GraphLab

GraphLab collaborative filtering library: efficient probabilistic matrix/tensor factorization on multicore

This webpage explains how to use GraphLab collaborative filtering library. In this library, multiple matrix decomposition algorithms are implemented. See description in the following papers:
Probablistic matrix/tensor factorization:
A) Liang Xiong, Xi Chen, Tzu-Kuo Huang, Jeff Schneider, Jaime G. Carbonell, Temporal Collaborative Filtering with Bayesian Probabilistic Tensor Factorization. In Proceedings of SIAM Data Mining, 2010. html (source code is also available).

B) Salakhutdinov and Mnih, Bayesian Probabilistic Matrix Factorization using Markov Chain Monte Carlo. in International Conference on Machine Learning, 2008. pdf project website, since our code implements matrix factorization as a sepcial case of a tensor as well.

C) Alternating least squares: Yunhong Zhou, Dennis Wilkinson, Robert Schreiber and Rong Pan. Large-Scale Parallel Collaborative Filtering for the Netflix Prize. Proceedings of the 4th international conference on Algorithmic Aspects in Information and Management. Shanghai, China pp. 337-348, 2008. pdf

D) SVD++ algorithm: Koren, Yehuda. "Factorization meets the neighborhood: a multifaceted collaborative filtering model." In Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, 426434. ACM, 2008. http://portal.acm.org/citation.cfm?id=1401890.1401944

E) SGD (sotchastic gradient descent) algorithm: Matrix Factorization Techniques for Recommender Systems Yehuda Koren, Robert Bell, Chris Volinsky In IEEE Computer, Vol. 42, No. 8. (07 August 2009), pp. 30-37.
F) Tikk, D. (2009). Scalable Collaborative Filtering Approaches for Large Recommender Systems. Journal of Machine Learning Research, 10, 623-656.

G) For Lanczos algorithm (SVD) see: wikipedia.

H) For NMF (non-negative matrix factorization) see: Lee, D..D., and Seung, H.S., (2001), 'Algorithms for Non-negative Matrix Factorization', Adv. Neural Info. Proc. Syst. 13, 556-562.

I) For Weighted-Alternating least squares: Collaborative Filtering for Implicit Feedback Datasets Hu, Y.; Koren, Y.; Volinsky, C. IEEE International Conference on Data Mining (ICDM 2008), IEEE (2008).
J) Pan, Yunhong Zhou, Bin Cao, Nathan N. Liu, Rajan Lukose, Martin Scholz, and Qiang Yang. 2008. One-Class Collaborative Filtering. In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining (ICDM '08). IEEE Computer Society, Washington, DC, USA, 502-511.

K) For sparse factor matrices see: Xi Chen, Yanjun Qi, Bing Bai, Qihang Lin and Jaime Carbonell. Sparse Latent Semantic Analysis. In SIAM International Conference on Data Mining (SDM), 2011.

D. Needell, J. A. Tropp CoSaMP: Iterative signal recovery from incomplete and inaccurate samples Applied and Computational Harmonic Analysis, Vol. 26, No. 3. (17 Apr 2008), pp. 301-321.

L) For SVD see Wikipedia

M) For time-SVD++, see Yehuda Koren. 2009. Collaborative filtering with temporal dynamics. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD '09). ACM, New York, NY, USA, 447-456. DOI=10.1145/1557019.1557072


N) For bias-SVD
Y. Koren. Factorization Meets the Neighborhood: a Multifaceted Collaborative Filtering Model. Equation (5), pdf.

O) For RBM:
G. Hinton. A Practical Guide to Training Restricted Boltzmann Machines. University of Toronto Tech report UTML TR 2010-003 pdf.

GraphLab collaborative filtering library: efficient probabilistic matrix/tensor factorization on multicore

GraphLab collaborative filtering library was written by Danny Bickson. The goal is to factorize a user/item matrix into two lower dimensional matrices. In other words, we build a linear model for the data, which can be later used for prediction of user/item pairs not seen before.
Please ask any questions about usage / bug reports in our google group.

News

  • 18 May 2012: The first GraphLab workshop is coming! More than 80 participants so far from about 50 companies and 12 universities. Mark you calendar: July 9 in San Francisco.
  • 15 Dec 2011: GraphLab receives additional NSF grant of 200,000 cpu hours on BlackLight/Kraken supercomputers. read more.
  • 11 Nov 2011: time-SVD++ is now implemented as part of the GraphLab collaborative filtering library. read more.
  • 1 Oct 2011: GraphLab now supports Eigen linear algebra package. read more.
  • 23 Aug 2011: released an optimized version for BlackLight supercomputer, which is working 5 times faster! read more.
  • 21 Aug 2011: Our paper efficient multicore collaborative filtering appeared today at the ACM KDD CUP workshop 2011. pptx slides.
  • 16 June 2011: GraphLab based matrix factorization code ended in the 5th place (out of more than 1000 research groups) in Yahoo! KDD CUP 2011. read more.
  • 6 June, 2011: more than 300 unique installations of the GraphLab collaborative filtering library!
  • 5 April, 2011: Updated instructions on how to install and run Graphlab matrix factorization with Yahoo! KDD CUP data are found here

    Target

    This code provides a highly efficient implementation of matrix/tensor factorization on multicore machines. Currently it was tested with tensor of size 16M x 16M x 100 with 1,000,000,000 non-zero entries, with up to 32 cores.

    Requirements

    Program input

    GraphLab collaborative filtering library has three inputs files: training, validation and test. Only the training input is mandatory, the validation and test inputs are optional. The training is used for training the model on observed user-item ratings, validation file is used to assess the quality of the trained model, and test data is used to predict unobserved ratings. The convension used: assume foobar is the training, then validation should be foobare and test should be foobart.
    The input is a sparse tensor (or matrix) prepared in on of the following ways:

    Matrix Market input files

    GraphLab supports sparse Matrix Market input files. Don't forget to use the flag --matrixmarket=true when running GraphLab, when you use this format.

    Matlab/Octave input

    There are several ways to prepare the program input. Perhapes the easiest is using the save_c_gl_mat.m script. The input to the script is a matrix of size NumOfRatings X 4.
    Each row contains the following format: [ from user] [ to movie ] [ time ] [ rating ].
    Note: when weighted alternaitng least squares is used, then [time] could be also [weight/confidence in ratings]
    Note: like Matlab, user ids should start from 1, movie ids should start from 1, and time bins start from one.
    Another useful script is prepare_dataset.m
    It takes as input a text file of the row format [user#] [movie#] [rating]
    or
    [user#] [movie#] [time/weight] [ rating]
    and converts it into GraphLab PMF format.

    For example, the following Matlab code generates a random 5x12 matrix, and creates two Graphlab PMF inputs, (training and validation), by splitting the data randomly 90% for training and the rest for validation:
    A=rand(5,12); [a,b,c]=find(A); A = [a b c]; save -ascii tempA.txt A; prepare_dataset('tempA.txt', 'svd5x12', .90);
    The result is the file svd5x12 (training) and svd5x12e (validation).
    Note: you should be careful, that when generation validation file, it should have the same number of users and movies.
    A third useful Matlab script is convert2seq.m
    which takes an adjacency list of N random node integer ids, and translates the list into a consecutive node ids between 0 to N-1. For example, assume our graph is a network of connections between IP addresses, where we have 100 IP addresses and 200 connections between them. Since GraphLab uses a continuous range of node ids, it is desirable to number the nodes between 0-99, and keep a mapping between each node id into its original IP address.

    Python input

    If you like to use Python for preparing the inputs, Python script example is found here.

    Mahout/Hadoop SequentialAccessSparseVector input files

    Use the following instructions for converting Mahout's SVD input files to GraphLab's.

    Program output

    The output of the program are three matrices U,V and T of size dim1 X D, dim2 X D and dim3 X D.
    The output is generated to a file named [inputfile].out
    There are several supported formats for the program output.

    Matrix market output

    When using the flag --matrixmarket=true, the output matrices U and V will be generated using two sparse matrix market output files.
    You can load the files using Matlab/Octave using the script mmread.m.

    Matlab/Octave output

    You can read the program output in Matlab/Octave using the command:
    >> itload('netflix.out') >> whos Name Size Bytes Class Attributes Time 27x30 6480 double User 95526x30 22926240 double Movie 3561x30 854640 double
    1) Download itload here and save it in your release/demoapps/pmf/ working folder.

    Python (binary) output

    Here is a script for reading the output in Python, thanks to Timmy Wilson, Smarttypes.org. parse_graphlab_pmf.py. You should use the flag --binaryoutput=true for this format to be selected.

    Computing predictions

    There are three possible ways to compute predictions of user/item pairs: Here are some more details:

    Using test input file

    Assume your training input file is mydataset, your test input file should be called mydatasett (namely a "t" was appended at the end of the filename). The test file includes user/item pairs to compute prediction on. It has the exact same input of the training and validation files. (The value given in the test file for the prediction is simply ignored).
    At the end of the run you will get an output file with the scalar prediction computed for each user/item pair.

    Using glcluster program

    For computing the top K predictions you can use the glcluster program as follows.
    Assume you run pmf on a file called "chapters.mm". The result will are two files chapters.mm.U and chapters.mm.V. To compute the recommended ratings, run:
    ln -s chapters.U chapters ln -s chapters.V chapterse ./glcluster chapters 8 3 0 --matrixmarket=true --training_ref=chapters.mm --ncpus=8
    The output of this command is:
    1) chapters.scalar-ratings.mtx
    2) chapters.recommended-items.mtx

    The file recommend-items include ids of items (starting from 0 and not from zero!) for each user. Each user is in a new row. The file scalar-ratings lists the computed scalar rating for each of the recommended items.

    Using Matlab

    For computing prediction of entry $x_{i,j,k}$ in the tensor, you should simply compute the product in Matlab:
    >> sum(User(i,:) .* Movie(j,:) .* Time(k,:))
    3) Alternatively, if this is a matrix, you can compute:
    >> sum(User(i,:) .* Movie(j,:))
    For example, if you want to predict user i=3 to movie j=8 you can do:
    >> sum(User(3,:) .* Movie(8,:))
    And you will get a scalar result that predicts the rating based on the linear model.

    Unit testing

    After installation, a good idea would be to try first the unit testing.
    cd release/tests ./runtests.sh 1
    You will see a report of all unit tests and their results. In case of any failure, please email the resulting output file stdout.log to danny.bickson@gmail.com .

    Running PMF

    Command line options

    After preparing the GraphLab input file using the save_c_gl_mat.m script, you should run:
    ./PMF [input file] [run mode] --scheduler="round_robin(max_iterations=XXX,block_size=1)" // where XXX is the number of desired iterations
    The following are the supported run modes:
    0 = Matrix factorization using alternating least squares 1 = Matrix factorization using MCMC procedure 2 = Tensor factorization using MCMC procedure, single edge exist between user and movies 3 = Tensor factorization, using MCMC procedure with support for multiple edges between user and movies in different times 4 = Tensor factorization using alternating least squars 5 = SVD++ - Koren's SVD++ algorithm 6 = SGD - stochastic gradient descient 7 = SVD - Lanczos algorithm 8 = NMF - non-negative matrix factorization algo of Lee and Seung 9 = Weighted alternating least squares 10 = Alternating least squares with sparse user factor matrix 11 = Alternating least squares with sparse user and movie factor matrices 12 = Alternating least squares with sparse movie factor matrix 13 = SVD - singular value decomposition (via double Lanczos method) 14 = time-SVD++ - Koren's method 15 = bias-SGD, stochastic gradient descient with user and item bias 16 = Restricted Bolzman Machines (RBM)
    The following are optional parameters (short list). You can view the full list using the command "./pmf --help".
    --matrixmarket=true - for matrix market format --debug=true - display debug info --npucs=XX - run with XX cpus --D=XX - feature vectors length (reasonable values 5-300) --lambda=XX - regularization parameter for matrices U and V (for ALS, default=1) --aggregatevalidation=true - use validation data for training --maxval=XX, --minval=XX it is recommended to set the min allowed ratings values and max allowed rating values to improve predictions.

    Running example 1 - Alternating matrix factorization

    1) Download movielens and movielense sample files.
    Example dataset is Movielens 1M ratings. You can download a preprocessed file for GraphLab format. movielens training file movielens test file

    2) Run the following command
    ./pmf movielens_mm 0 --scheduler="round_robin(max_iterations=10,block_size=1)" --matrixmarket=true --lambda=0.065 --ncpus=2
    You should see an output like:
    ./pmf movielens_mm 0 --scheduler="round_robin(max_iterations=10,block_size=1)" --matrixmarket=true --lambda=0.065 --ncpus=2 INFO: pmf.cpp(do_main:434): PMF/BPTF/ALS/SVD++/time-SVD++/SGD/Lanczos/SVD Code written By Danny Bickson, CMU Send bug reports and comments to danny.bickson@gmail.com WARNING: pmf.cpp(do_main:441): Program compiled with it++ Support Setting run mode ALS_MATRIX (Alternating least squares) WARNING: pmf.h(verify_setup:410): It is recommended to set min and max allowed matrix values to improve prediction quality, using the flags --minval=XX, --maxval=XX INFO: pmf.cpp(start:285): ALS_MATRIX (Alternating least squares) starting loading data file movielens_mm Loading Matrix Market file movielens_mm TRAINING Loading movielens_mm TRAINING Matrix size is: USERS 6040 MOVIES 3952 TIME BINS 1 loading data file movielens_mme Loading Matrix Market file movielens_mme VALIDATION Loading movielens_mme VALIDATION Matrix size is: USERS 6040 MOVIES 3952 TIME BINS 1 loading data file movielens_mmt Loading Matrix Market file movielens_mmt TEST Loading movielens_mmt TEST skipping file setting regularization weight to 0.065 INFO: asynchronous_engine.hpp(run:137): Worker (Sync) 1 started. INFO: asynchronous_engine.hpp(run:137): Worker (Sync) 0 started. ALS_MATRIX (Alternating least squares) for matrix (6040, 3952, 1):900000. D=20 pU=0.065, pV=0.065, pT=1, D=20 complete. Objective=6.17196e+06, TRAIN RMSE=3.7034 VALIDATION RMSE=3.7079. INFO: pmf.cpp(run_graphlab:232): starting with scheduler: round_robin max iterations = 10 step = 1 max_iterations = 10 INFO: asynchronous_engine.hpp(run:111): Worker 0 started. INFO: asynchronous_engine.hpp(run:111): Worker 1 started. Entering last iter with 1 total updates so far 9991 1.18921) Iter ALS_MATRIX (Alternating least squares) 1 Obj=2.37071e+06, TRAIN RMSE=2.2903 VALIDATION RMSE=0.9347. Entering last iter with 2 total updates so far 19984 2.37481) Iter ALS_MATRIX (Alternating least squares) 2 Obj=572734, TRAIN RMSE=1.1217 VALIDATION RMSE=0.8934. Entering last iter with 3 total updates so far 29974 3.55853) Iter ALS_MATRIX (Alternating least squares) 3 Obj=404727, TRAIN RMSE=0.9425 VALIDATION RMSE=0.8690. Entering last iter with 4 total updates so far 39968 4.74525) Iter ALS_MATRIX (Alternating least squares) 4 Obj=338742, TRAIN RMSE=0.8624 VALIDATION RMSE=0.8605. Entering last iter with 5 total updates so far 49960 5.92972) Iter ALS_MATRIX (Alternating least squares) 5 Obj=307645, TRAIN RMSE=0.8221 VALIDATION RMSE=0.8562. Entering last iter with 6 total updates so far 59951 7.11545) Iter ALS_MATRIX (Alternating least squares) 6 Obj=290248, TRAIN RMSE=0.7988 VALIDATION RMSE=0.8536. Entering last iter with 7 total updates so far 69943 8.30341) Iter ALS_MATRIX (Alternating least squares) 7 Obj=279441, TRAIN RMSE=0.7840 VALIDATION RMSE=0.8519.

    Running example 2 - MCMC matrix factorization

    ~/newgraphlab/graphlabapi/debug/apps/pmf$ ./PMF netflix 3 --ncpus=16 --D=30 --max_iter=30 --burn_in=20 --scheduler="round_robin(max_iterations=15,block_size=1)" setting regularization 1.000000e+01 setting run mode 3 INFO :pmf.cpp(main:1096): BPTF starting loading data file netflix Loading netflix train Creating 3298163 edges... .................loading data file netflixe Loading netflixe test Creating 545177 edges... ...BPTF for tensor (95526, 3561, 27):3298163. D=30 nuAlpha=1, Walpha=1, mu=0, muT=1, nu=30, beta=1, W=1, WT=1 BURN_IN=20 complete. Obj=2.30766e+07, TEST RMSE=3.7948. sampled alpha is 0.999809 INFO :asynchronous_engine.hpp(run:56): Worker 0 started. INFO :asynchronous_engine.hpp(run:56): Worker 1 started. INFO :asynchronous_engine.hpp(run:56): Worker 2 started. INFO :asynchronous_engine.hpp(run:56): Worker 3 started. INFO :asynchronous_engine.hpp(run:56): Worker 4 started. INFO :asynchronous_engine.hpp(run:56): Worker 5 started. INFO :asynchronous_engine.hpp(run:56): Worker 6 started. INFO :asynchronous_engine.hpp(run:56): Worker 7 started. INFO :asynchronous_engine.hpp(run:56): Worker 8 started. INFO :asynchronous_engine.hpp(run:56): Worker 9 started. INFO :asynchronous_engine.hpp(run:56): Worker 10 started. INFO :asynchronous_engine.hpp(run:56): Worker 11 started. INFO :asynchronous_engine.hpp(run:56): Worker 12 started. INFO :asynchronous_engine.hpp(run:56): Worker 14 started. INFO :asynchronous_engine.hpp(run:56): Worker 15 started. INFO :asynchronous_engine.hpp(run:56): Worker 13 started. Entering last iter with 1 11.9759) Iter BPTF 1 Obj=2.31406e+07, TRAIN RMSE=3.7285 TEST RMSE=3.1044. sampled alpha is 0.0719377 Entering last iter with 2 19.5264) Iter BPTF 2 Obj=5.4579e+06, TRAIN RMSE=1.7788 TEST RMSE=1.0372. sampled alpha is 0.316193 Entering last iter with 3 27.0199) Iter BPTF 3 Obj=1.94329e+06, TRAIN RMSE=1.0122 TEST RMSE=1.0110. sampled alpha is 0.97671 Entering last iter with 4 34.4465) Iter BPTF 4 Obj=1.85077e+06, TRAIN RMSE=0.9807 TEST RMSE=0.9986. sampled alpha is 1.03911 Entering last iter with 5 41.9269) Iter BPTF 5 Obj=1.79675e+06, TRAIN RMSE=0.9583 TEST RMSE=0.9901. sampled alpha is 1.08866 Entering last iter with 6 49.4382) Iter BPTF 6 Obj=1.77346e+06, TRAIN RMSE=0.9424 TEST RMSE=0.9800. sampled alpha is 1.12418 Entering last iter with 7 56.8816) Iter BPTF 7 Obj=1.75737e+06, TRAIN RMSE=0.9283 TEST RMSE=0.9701. sampled alpha is 1.15981 Entering last iter with 8 64.4001) Iter BPTF 8 Obj=1.75944e+06, TRAIN RMSE=0.9175 TEST RMSE=0.9622. sampled alpha is 1.18641 Entering last iter with 9 71.9716) Iter BPTF 9 Obj=1.75971e+06, TRAIN RMSE=0.9095 TEST RMSE=0.9575. sampled alpha is 1.20994 Entering last iter with 10 79.4099) Iter BPTF 10 Obj=1.7484e+06, TRAIN RMSE=0.8992 TEST RMSE=0.9541. sampled alpha is 1.23627 Entering last iter with 11 86.894) Iter BPTF 11 Obj=1.72904e+06, TRAIN RMSE=0.8860 TEST RMSE=0.9512. sampled alpha is 1.27441 Entering last iter with 12 94.3665) Iter BPTF 12 Obj=1.7215e+06, TRAIN RMSE=0.8775 TEST RMSE=0.9491. sampled alpha is 1.29865 Entering last iter with 13 101.821) Iter BPTF 13 Obj=1.72508e+06, TRAIN RMSE=0.8725 TEST RMSE=0.9491. sampled alpha is 1.31404 Entering last iter with 14 109.313) Iter BPTF 14 Obj=1.7315e+06, TRAIN RMSE=0.8691 TEST RMSE=0.9488. sampled alpha is 1.32381 Entering last iter with 15 116.799) Iter BPTF 15 Obj=1.74022e+06, TRAIN RMSE=0.8661 TEST RMSE=0.9501. sampled alpha is 1.33515 Entering last iter with 16 124.317) Iter BPTF 16 Obj=1.74469e+06, TRAIN RMSE=0.8626 TEST RMSE=0.9502. sampled alpha is 1.34253 Entering last iter with 17 131.859) Iter BPTF 17 Obj=1.74901e+06, TRAIN RMSE=0.8593 TEST RMSE=0.9512. sampled alpha is 1.3544 Entering last iter with 18 139.371) Iter BPTF 18 Obj=1.75623e+06, TRAIN RMSE=0.8563 TEST RMSE=0.9520. sampled alpha is 1.3633 Entering last iter with 19 146.946) Iter BPTF 19 Obj=1.76375e+06, TRAIN RMSE=0.8535 TEST RMSE=0.9513. Finished burn-in period. starting to aggregate samples sampled alpha is 1.37124 Entering last iter with 20 154.395) Iter BPTF 20 Obj=1.77025e+06, TRAIN RMSE=0.8511 TEST RMSE=0.9514. sampled alpha is 1.37925 Entering last iter with 21 161.914) Iter BPTF 21 Obj=1.77484e+06, TRAIN RMSE=0.8483 TEST RMSE=0.9515. sampled alpha is 1.39034 Entering last iter with 22 169.413) Iter BPTF 22 Obj=1.748e+06, TRAIN RMSE=0.8347 TEST RMSE=0.9332. sampled alpha is 1.43515 Entering last iter with 23 176.906) Iter BPTF 23 Obj=1.7427e+06, TRAIN RMSE=0.8281 TEST RMSE=0.9262. sampled alpha is 1.45834 Entering last iter with 24 184.414) Iter BPTF 24 Obj=1.73938e+06, TRAIN RMSE=0.8233 TEST RMSE=0.9224. sampled alpha is 1.47744 Entering last iter with 25 191.896) Iter BPTF 25 Obj=1.73633e+06, TRAIN RMSE=0.8191 TEST RMSE=0.9199. sampled alpha is 1.49088 Entering last iter with 26 199.381) Iter BPTF 26 Obj=1.7327e+06, TRAIN RMSE=0.8152 TEST RMSE=0.9178. sampled alpha is 1.50575 Entering last iter with 27 206.87) Iter BPTF 27 Obj=1.73199e+06, TRAIN RMSE=0.8116 TEST RMSE=0.9161. sampled alpha is 1.51825 Entering last iter with 28 214.36) Iter BPTF 28 Obj=1.7314e+06, TRAIN RMSE=0.8082 TEST RMSE=0.9148. sampled alpha is 1.53222 Entering last iter with 29 221.862) Iter BPTF 29 Obj=1.72802e+06, TRAIN RMSE=0.8050 TEST RMSE=0.9136. sampled alpha is 1.5427 INFO :asynchronous_engine.hpp(run:66): Worker 7 finished. INFO :asynchronous_engine.hpp(run:66): Worker 2 finished. INFO :asynchronous_engine.hpp(run:66): Worker 8 finished. INFO :asynchronous_engine.hpp(run:66): Worker 5 finished. INFO :asynchronous_engine.hpp(run:66): Worker 15 finished. INFO :asynchronous_engine.hpp(run:66): Worker 14 finished. INFO :asynchronous_engine.hpp(run:66): Worker 4 finished. INFO :asynchronous_engine.hpp(run:66): Worker 1 finished. INFO :asynchronous_engine.hpp(run:66): Worker 6 finished. INFO :asynchronous_engine.hpp(run:66): Worker 3 finished. INFO :asynchronous_engine.hpp(run:66): Worker 11 finished. INFO :asynchronous_engine.hpp(run:66): Worker 12 finished. INFO :asynchronous_engine.hpp(run:66): Worker 10 finished. INFO :asynchronous_engine.hpp(run:66): Worker 9 finished. INFO :asynchronous_engine.hpp(run:66): Worker 13 finished. INFO :asynchronous_engine.hpp(run:66): Worker 0 finished. Final result. Obj=1.72124e+06, TEST RMSE= 0.9137. Finished in 225.948695

    Running example 3: Yahoo! KDD Cup 2011 - Track1

    For explanations on how to use GraphLab Yahoo! KDD Cup, including conversion of the data using Matlab or Python see: my blog.

    Running example 4: BPTF (Bayesian monte carlo matrix factorization) using Twitter social graph

    This example was donated by Timmy Wilson @ smarttypes.org. It contains a twitter network of 68 followers, 11646 followies, 1 day and 15883 links. Download the input file here
    <29|0>bickson@biggerbro:~/newgraphlab/graphlabapi/debug/demoapps/pmf$ ./pmf smarttypes_pmf 1 --scheduler="round_robin(max_iterations=20,block_size=1)" --float=true INFO: pmf.cpp(main:1260): PMF/ALS/SVD++/SGD Code written By Danny Bickson, CMU Send bug reports and comments to danny.bickson@gmail.com WARNING: pmf.cpp(main:1262): Code compiled with GL_NO_MULT_EDGES flag - this mode does not support multiple edges between user and movie in different times Setting run mode BPTF_MATRIX INFO: pmf.cpp(main:1309): BPTF_MATRIX starting loading data file smarttypes_pmf Loading smarttypes_pmf TRAINING Matrix size is: USERS 68 MOVIES 11646 TIME BINS 1 Creating 15883 edges (observed ratings)... .loading data file smarttypes_pmfe Loading smarttypes_pmfe VALIDATION skipping file loading data file smarttypes_pmft Loading smarttypes_pmft TEST skipping file setting regularization weight to 1 BPTF_MATRIX for matrix (68, 11646, 1):15883. D=20 pU=1, pV=1, pT=1, muT=1, D=20 nuAlpha=1, Walpha=1, mu=0, muT=1, nu=20, beta=1, W=1, WT=1 BURN_IN=10 complete. Obj=7576.43, TRAIN RMSE=0.9513 VALIDATION RMSE=nan. sampled alpha is 0.997129 max iterations = 20 step = 1 max_iterations = 20 INFO: asynchronous_engine.hpp(run:94): Worker 0 started. INFO: asynchronous_engine.hpp(run:94): Worker 1 started. Entering last iter with 1 0.361552) Iter BPTF_MATRIX 1 Obj=4646.35, TRAIN RMSE=0.7270 VALIDATION RMSE=nan. sampled alpha is 1.90087 Entering last iter with 2 0.728271) Iter BPTF_MATRIX 2 Obj=1698.21, TRAIN RMSE=0.3103 VALIDATION RMSE=nan. sampled alpha is 10.4702 Entering last iter with 3 1.12834) Iter BPTF_MATRIX 3 Obj=1368.41, TRAIN RMSE=0.1506 VALIDATION RMSE=nan. sampled alpha is 44.3981 Entering last iter with 4 1.49151) Iter BPTF_MATRIX 4 Obj=1276.31, TRAIN RMSE=0.1245 VALIDATION RMSE=nan. sampled alpha is 63.932 Entering last iter with 5 1.89511) Iter BPTF_MATRIX 5 Obj=1203.64, TRAIN RMSE=0.0904 VALIDATION RMSE=nan. sampled alpha is 122.476 Entering last iter with 6 2.25427) Iter BPTF_MATRIX 6 Obj=1178.26, TRAIN RMSE=0.0744 VALIDATION RMSE=nan. sampled alpha is 180.563 Entering last iter with 7 2.65659) Iter BPTF_MATRIX 7 Obj=1170.38, TRAIN RMSE=0.0575 VALIDATION RMSE=nan. sampled alpha is 297.039 Entering last iter with 8 3.02014) Iter BPTF_MATRIX 8 Obj=1160.73, TRAIN RMSE=0.0477 VALIDATION RMSE=nan. sampled alpha is 419.463 Entering last iter with 9 3.42518) Iter BPTF_MATRIX 9 Obj=1162.77, TRAIN RMSE=0.0394 VALIDATION RMSE=nan. Finished burn-in period. starting to aggregate samples sampled alpha is 610.536 Entering last iter with 10 3.79515) Iter BPTF_MATRIX 10 Obj=1161.87, TRAIN RMSE=0.0341 VALIDATION RMSE=nan. sampled alpha is 810.82 Entering last iter with 11 4.19491) Iter BPTF_MATRIX 11 Obj=1469.61, TRAIN RMSE=0.1970 VALIDATION RMSE=nan. sampled alpha is 25.4017 Entering last iter with 12 4.56205) Iter BPTF_MATRIX 12 Obj=1484.45, TRAIN RMSE=0.2007 VALIDATION RMSE=nan. sampled alpha is 24.5661 Entering last iter with 13 4.96378) Iter BPTF_MATRIX 13 Obj=1230.12, TRAIN RMSE=0.0700 VALIDATION RMSE=nan. sampled alpha is 203.111 Entering last iter with 14 5.33124) Iter BPTF_MATRIX 14 Obj=1229.07, TRAIN RMSE=0.0718 VALIDATION RMSE=nan. sampled alpha is 193.54 Entering last iter with 15 5.72784) Iter BPTF_MATRIX 15 Obj=1209.51, TRAIN RMSE=0.0424 VALIDATION RMSE=nan. sampled alpha is 536.412 Entering last iter with 16 6.101) Iter BPTF_MATRIX 16 Obj=1214.21, TRAIN RMSE=0.0419 VALIDATION RMSE=nan. sampled alpha is 555.104 Entering last iter with 17 6.49673) Iter BPTF_MATRIX 17 Obj=1212.21, TRAIN RMSE=0.0310 VALIDATION RMSE=nan. sampled alpha is 1000.04 Entering last iter with 18 6.87056) Iter BPTF_MATRIX 18 Obj=1215.99, TRAIN RMSE=0.0307 VALIDATION RMSE=nan. sampled alpha is 987.797 Entering last iter with 19 7.2658) Iter BPTF_MATRIX 19 Obj=1217.74, TRAIN RMSE=0.0237 VALIDATION RMSE=nan. sampled alpha is 1596.85 Entering last iter with 20 7.64149) Iter BPTF_MATRIX 20 Obj=1224.86, TRAIN RMSE=0.0233 VALIDATION RMSE=nan. sampled alpha is 1677.19 INFO: asynchronous_engine.hpp(run:102): Worker 1 finished. INFO: asynchronous_engine.hpp(run:102): Worker 0 finished. Final result. Obj=1222.59, TRAIN RMSE= 0.0155 VALIDATION RMSE= nan. Finished in 7.686977 Performance counters are: 0) EDGE_TRAVERSAL, 0.735296 Performance counters are: 1) BPTF_SAMPLE_STEP, 0.803732 Performance counters are: 2) CALC_RMSE_Q, 0.005395 Performance counters are: 6) CALC_OBJ, 0.028909 Performance counters are: 7) BPTF_MVN_RNDEX, 4.1201 Performance counters are: 8) BPTF_LEAST_SQUARES2, 1.15168 === REPORT FOR core() === [Numeric] ncpus: 2 [Other] affinities: false compile_flags: engine: async scheduler: round_robin schedyield: true scope: edge === REPORT FOR engine() === [Numeric] num_edges: 15883 num_syncs: 0 num_vertices: 11714 updatecount: 234280 [Timings] runtime: 7.6 s [Other] termination_reason: task depletion (natural) [Numeric] updatecount_vector: 234280 (count: 2, min: 117120, max: 117160, avg: 117140) updatecount_vector.values: 117120,117160, <30|0>bickson@biggerbro:~/newgraphlab/graphlabapi/debug/demoapps/pmf$

    Running example 5 - implicit rating via weighted-ALS

    This example shows how Graphlab collaborative filtering can handle implicit rating datasets. It is recommended to read the paper: One-Class Collaborative Filtering by: Rong Pan, Yunhong Zhou, Bin Cao, N. N. Liu, R. Lukose, M. Scholz, Qiang Yang. Data Mining, IEEE International Conference on In Data Mining, 2008. ICDM '08, for understanding the construction.
    ./pmf netflix 9 --scheduler="round_robin(max_iterations=10,block_size=1)" --zero=true --implicitratingtype=uniform --implicitratingpercentage=0.03 --implicitratingvalue=0 --implicitratingweight=0.5 Starting program: /mnt/bigbrofs/usr6/bickson/newgraphlab/graphlabapi/debug/demoapps/pmf/pmf netflix 9 --scheduler="round_robin(max_iterations=10,block_size=1)" --zero=true --implicitratingtype=uniform --implicitratingpercentage=0.03 --implicitratingvalue=0 --implicitratingweight=0.5 warning: no loadable sections found in added symbol-file system-supplied DSO at 0x7fffa63fd000 [Thread debugging using libthread_db enabled] [New Thread 47893315401712 (LWP 8946)] INFO: pmf.cpp(do_main:417): PMF/BPTF/ALS/SVD++/SGD/SVD Code written By Danny Bickson, CMU Send bug reports and comments to danny.bickson@gmail.com WARNING: pmf.cpp(do_main:424): Program compiled with it++ Support Setting run mode Weighted alternating least squares INFO: pmf.cpp(start:269): Weighted alternating least squares starting loading data file netflix Loading netflix TRAINING Matrix size is: USERS 95526 MOVIES 3561 TIME BINS 27 Creating 3298163 edges (observed ratings)... .................INFO: implicit.hpp(add_implicit_edges:77): added 9881029 implicit edges, rating=0 weight=0.5 type=uniform loading data file netflixe Loading netflixe VALIDATION Matrix size is: USERS 95526 MOVIES 3561 TIME BINS 27 Creating 545177 edges (observed ratings)... ...loading data file netflixt Loading netflixt TEST skipping file setting regularization weight to 1 Weighted alternating least squares for matrix (95526, 3561, 27):13179192. D=20 pU=1, pV=1, pT=1, D=20 complete. Objective=1.99427e+08, TRAIN RMSE=5.5012 VALIDATION RMSE=11.6063. [New Thread 1199630672 (LWP 8976)] INFO: pmf.cpp(run_graphlab:219): starting with scheduler: round_robin max iterations = 10 step = 1 max_iterations = 10 INFO: asynchronous_engine.hpp(run:111): Worker 0 started. [New Thread 1216416080 (LWP 8978)] INFO: asynchronous_engine.hpp(run:111): Worker 1 started. Entering last iter with 1 28.4887) Iter Weighted alternating least squares 1 Obj=1.84392e+08, TRAIN RMSE=5.2859 VALIDATION RMSE=8.2575. Entering last iter with 2 56.6783) Iter Weighted alternating least squares 2 Obj=5.09993e+07, TRAIN RMSE=2.7740 VALIDATION RMSE=5.3221. Entering last iter with 3 84.1871) Iter Weighted alternating least squares 3 Obj=3.59321e+07, TRAIN RMSE=2.3284 VALIDATION RMSE=4.9189. Entering last iter with 4 113.502) Iter Weighted alternating least squares 4 Obj=3.10098e+07, TRAIN RMSE=2.1633 VALIDATION RMSE=4.7755.
    The relevant command line flags related to implicit ratings are:
    --implicitratingtype=user or --implicitratingtype=uniform Adds implicit edges proportional to the current user edge num, or uniformly to every user. --implicitratingpercentage - a number between 0 to 1 which determines what is the precentage of edges to add to the sparse model. 0 means none while 1 means fully dense model. --implicitratingvale - what is the value of the rating added. On default it is zero, but you can change it. --implicitratingweight - what is the weight of the implicit rating (or time). On default it is one.

    Running example 6: Netflix data with sparse movie factor matrix

    In this example we show how to factorize netflix data, with the requirement that 90% of the movie factor matrix will be zeros. Next, you can use the sparse matrices for performing clustering of similar user or movies together into related groups.
    bickson@biggerbro:~/newgraphlab/graphlabapi/debug/demoapps/pmf $ ./pmf netflix 12 --scheduler="round_robin(max_iterations=10,block_size=1)" --float=false --ncpus=8 --desired_factor_sparsity=0.9 --lambda=0.06 INFO: pmf.cpp(main:565): PMF/BPTF/ALS/SVD++/SGD/SVD Code written By Danny Bickson, CMU Send bug reports and comments to danny.bickson@gmail.com WARNING: pmf.cpp(main:567): Code compiled with GL_NO_MULT_EDGES flag - this mode does not support multiple edges between user and movie in different times WARNING: pmf.cpp(main:570): Code compiled with GL_NO_MCMC flag - this mode does not support MCMC methods. Setting run mode Alternating least squares with sparse movie factor matrix INFO: pmf.cpp(start:370): Alternating least squares with sparse movie factor matrix starting loading data file netflix Loading netflix TRAINING Matrix size is: USERS 95526 MOVIES 3561 TIME BINS 27 Creating 3298163 edges (observed ratings)... .................loading data file netflixe Loading netflixe VALIDATION Matrix size is: USERS 95526 MOVIES 3561 TIME BINS 27 Creating 545177 edges (observed ratings)... ...loading data file netflixt Loading netflixt TEST skipping file setting regularization weight to 0.06 Alternating least squares with sparse movie factor matrix for matrix (95526, 3561, 27):3298163. D=20 pU=0.06, pV=0.06, pT=1, D=20 Current sparsity : 0 % complete. Objective=1.30614e+07, TRAIN RMSE=2.8139 VALIDATION RMSE=2.8790. max iterations = 10 step = 1 max_iterations = 10 INFO: asynchronous_engine.hpp(run:111): Worker 0 started. INFO: asynchronous_engine.hpp(run:111): Worker 1 started. INFO: asynchronous_engine.hpp(run:111): Worker 2 started. INFO: asynchronous_engine.hpp(run:111): Worker 3 started. INFO: asynchronous_engine.hpp(run:111): Worker 4 started. INFO: asynchronous_engine.hpp(run:111): Worker 6 started. INFO: asynchronous_engine.hpp(run:111): Worker 7 started. INFO: asynchronous_engine.hpp(run:111): Worker 5 started. Entering last iter with 1 Current sparsity : 0.9 2.61367) Iter Alternating least squares with sparse movie factor matrix 1 Obj=8.39139e+06, TRAIN RMSE=2.2338 VALIDATION RMSE=2.4251. Entering last iter with 2 Current sparsity : 0.95 % 5.25192) Iter Alternating least squares with sparse movie factor matrix 2 Obj=2.52153e+06, TRAIN RMSE=1.2152 VALIDATION RMSE=1.6419. Entering last iter with 3 Current sparsity : 0.9 7.88379) Iter Alternating least squares with sparse movie factor matrix 3 Obj=2.36985e+06, TRAIN RMSE=1.1787 VALIDATION RMSE=1.3749. Entering last iter with 4 Current sparsity : 0.9 10.5112) Iter Alternating least squares with sparse movie factor matrix 4 Obj=2.57171e+06, TRAIN RMSE=1.2280 VALIDATION RMSE=1.3589. Entering last iter with 5 Current sparsity : 0.9 13.0986) Iter Alternating least squares with sparse movie factor matrix 5 Obj=2.76916e+06, TRAIN RMSE=1.2758 VALIDATION RMSE=1.3188. Entering last iter with 6 Current sparsity : 0.9 % 15.7324) Iter Alternating least squares with sparse movie factor matrix 6 Obj=2.74914e+06, TRAIN RMSE=1.2721 VALIDATION RMSE=1.2410. Entering last iter with 7 Current sparsity : 0.9 18.3847) Iter Alternating least squares with sparse movie factor matrix 7 Obj=2.53998e+06, TRAIN RMSE=1.2239 VALIDATION RMSE=1.0778. Entering last iter with 8 Current sparsity : 0.9 20.9803) Iter Alternating least squares with sparse movie factor matrix 8 Obj=1.84584e+06, TRAIN RMSE=1.0436 VALIDATION RMSE=0.9723. Entering last iter with 9 Current sparsity : 0.9 23.6121) Iter Alternating least squares with sparse movie factor matrix 9 Obj=1.48064e+06, TRAIN RMSE=0.9341 VALIDATION RMSE=0.9608. Entering last iter with 10 Current sparsity : 0.9 26.1979) Iter Alternating least squares with sparse movie factor matrix 10 Obj=1.43894e+06, TRAIN RMSE=0.9217 VALIDATION RMSE=0.9596. INFO: asynchronous_engine.hpp(run:119): Worker 4 finished. INFO: asynchronous_engine.hpp(run:119): Worker 0 finished. INFO: asynchronous_engine.hpp(run:119): Worker 2 finished. INFO: asynchronous_engine.hpp(run:119): Worker 6 finished. INFO: asynchronous_engine.hpp(run:119): Worker 1 finished. INFO: asynchronous_engine.hpp(run:119): Worker 7 finished. INFO: asynchronous_engine.hpp(run:119): Worker 5 finished. INFO: asynchronous_engine.hpp(run:119): Worker 3 finished. Current sparsity : 0.9 Final result. Obj=1.43894e+06, TRAIN RMSE= 0.9217 VALIDATION RMSE= 0.9596. Finished in 26.611790 seconds Performance counters are: 0) EDGE_TRAVERSAL, 49.7254 Performance counters are: 2) CALC_RMSE_Q, 0.001046 Performance counters are: 3) ALS_LEAST_SQUARES, 81.24 Performance counters are: 6) CALC_OBJ, 0.59852 === REPORT FOR core() === [Numeric] ncpus: 8 [Other] affinities: false compile_flags: engine: async scheduler: round_robin schedyield: true scope: edge === REPORT FOR engine() === [Numeric] num_edges: 3.29816e+06 num_syncs: 0 num_vertices: 99087 updatecount: 990870 [Timings] runtime: 26.2 s [Other]

    Running example 7: loading from matrix market sparse matrix format

    1) Donload the input file smallnetflix_mm and smallnetflix_mme. Those are text input files with the following format:
    %%MatrixMarket matrix coordinate real general % Generated 28-Aug-2011 95526 3561 3298163 13 1 1 83 1 2 127 1 2 136 1 5 137 1 4 1
    Where 95526 is the number of users, 3561 is the number of users, and 3298163 is the number of ratings. Each row has one rating: the first row, user 13 rated movie 1 and gave it a rating of 1.
    2) Run alternating least suqares. Don't forget the switch --matrixmarket=true .
    <33|1>bickson@biggerbro:~/newgraphlab/graphlabapi/release/demoapps/pmf$ ./pmf smallnetflix_mm 0 --matrixmarket=true --scheduler="round_robin(max_iterations=10,block_size=1)" INFO: pmf.cpp(do_main:465): PMF/BPTF/ALS/SVD++/SGD/SVD Code written By Danny Bickson, CMU Send bug reports and comments to danny.bickson@gmail.com WARNING: pmf.cpp(do_main:467): Code compiled with GL_NO_MULT_EDGES flag - this mode does not support multiple edges between user and movie in different times Setting run mode ALS_MATRIX (Alternating least squares) INFO: pmf.cpp(start:308): ALS_MATRIX (Alternating least squares) starting loading data file smallnetflix_mm Loading Matrix Market file smallnetflix_mm TRAINING Loading smallnetflix_mm TRAINING Matrix size is: USERS 95526 MOVIES 3561 TIME BINS 1 loading data file smallnetflix_mme Loading Matrix Market file smallnetflix_mme VALIDATION Loading smallnetflix_mme VALIDATION Matrix size is: USERS 95526 MOVIES 3561 TIME BINS 1 loading data file smallnetflix_mmt Loading Matrix Market file smallnetflix_mmt TEST Loading smallnetflix_mmt TEST skipping file setting regularization weight to 1 ALS_MATRIX (Alternating least squares) for matrix (95526, 3561, 1):3298163. D=20 pU=1, pV=1, pT=1, D=20 complete. Objective=2.26985e+07, TRAIN RMSE=3.7098 VALIDATION RMSE=3.7762. max iterations = 10 step = 1 max_iterations = 10 INFO: asynchronous_engine.hpp(run:111): Worker 0 started. INFO: asynchronous_engine.hpp(run:111): Worker 1 started. Entering last iter with 1 4.52759) Iter ALS_MATRIX (Alternating least squares) 1 Obj=2.13107e+07, TRAIN RMSE=3.5919 VALIDATION RMSE=2.5129. Entering last iter with 2 9.05031) Iter ALS_MATRIX (Alternating least squares) 2 Obj=2.76594e+06, TRAIN RMSE=1.2658 VALIDATION RMSE=1.4300. ....

    Running Example 8: SVD++ with movielens data

    You can download movielens data (training file is: movielens_mm, validation file is: movielens_mme) from here.
    ./pmf movielens_mm 5 --scheduler="round_robin(max_iterations=10,block_size=1)" --float=true --ncpus=8 --maxval=5 --minval=1 --matrixmarket=true INFO: pmf.cpp(do_main:434): PMF/BPTF/ALS/SVD++/time-SVD++/SGD/Lanczos/SVD Code written By Danny Bickson, CMU Send bug reports and comments to danny.bickson@gmail.com WARNING: pmf.cpp(do_main:441): Program compiled with it++ Support Setting run mode SVD++ INFO: pmf.cpp(start:285): SVD++ starting loading data file movielens_mm Loading Matrix Market file movielens_mm TRAINING Loading movielens_mm TRAINING Matrix size is: USERS 6040 MOVIES 3952 TIME BINS 1 loading data file movielens_mme Loading Matrix Market file movielens_mme VALIDATION Loading movielens_mme VALIDATION Matrix size is: USERS 6040 MOVIES 3952 TIME BINS 1 loading data file movielens_mmt Loading Matrix Market file movielens_mmt TEST Loading movielens_mmt TEST skipping file ... SVD++ for matrix (6040, 3952, 1):900000. D=20 SVD++ 20 factors complete. Objective=2.88718e-305, TRAIN RMSE=0.0000 VALIDATION RMSE=0.0000. INFO: pmf.cpp(run_graphlab:232): starting with scheduler: round_robin max iterations = 10 step = 1 max_iterations = 10 ... Entering last iter with 1 0.662835) Iter SVD 1, TRAIN RMSE=1.7591 VALIDATION RMSE=1.6971. Entering last iter with 2 1.08171) Iter SVD 2, TRAIN RMSE=1.6513 VALIDATION RMSE=1.5921. Entering last iter with 3 1.46447) Iter SVD 3, TRAIN RMSE=1.5506 VALIDATION RMSE=1.5136. Entering last iter with 4 1.88655) Iter SVD 4, TRAIN RMSE=1.4606 VALIDATION RMSE=1.4388. Entering last iter with 5 2.34319) Iter SVD 5, TRAIN RMSE=1.3883 VALIDATION RMSE=1.3739. Entering last iter with 6 2.75348) Iter SVD 6, TRAIN RMSE=1.3335 VALIDATION RMSE=1.3205. Entering last iter with 7 3.65773) Iter SVD 7, TRAIN RMSE=1.2506 VALIDATION RMSE=1.3780.

    Other examples

    Further examples are found in the datasets and benchmark page.

    Debugging execution

    To debug your dataset features, run with the --stats=true command line option.
    For example: ./pmf netflix 0 --stats=true.

    You can also use the --debug=true flag for having debug traces printed.

    Acknowledgements

    As the project is growing, the list of people we should thank is growing..
  • Liang Xiong, CMU webpage for providing the Matlab code of BPTF, numerous discussions and infinite support!! Thanks!!
  • Timmy Wilson, Smarttypes.org for providing twitter network snapshot example, and Python scripts for reading the output.
  • Sanmi Koyejo, from the University of Austin, Texas, for providing Python scripts for preparing the inputs.
  • Dan Brickely, from VU University Amsertdam, for helping debugging installation and prepare the input in Octave.
  • Nicholas Ampazis, University of the Aegean, for providing his SVD++ source ode.
  • Yehuda Koren, Yahoo! Reseach, for providing his SVD++ source code implementation.
  • Marinka Zitnik, University of Ljubljana, Slovenia, for helping debugging ALS and suggesting NMF algos to implement.
  • Joel Welling from Pittsburgh Supercomputing Center, for optimizing GraphLab on BlackLight supercomputer and simplifying installation procedure.
  • Sagar Soni from Gujarat Technological University and Hasmukh Goswami College of Engineering for helping testing the code.
  • Young Cha, UCLA for testing the code.
  • Mohit Singh for helping improve documentation.
  • Nicholas Kolegraff for testing our examples.
  • Theo Throuillon, Ecole Nationale Superieure d'Informatique et de Mathematiques Appliquees de Grenoble for debugging NMF.
  • Qiang Yan, Chinese Academy of Science for providing time-svd++, bias-SVD, RBM code that the Graphlab version is based on.