This webpage explains how to use GraphLab collaborative filtering library. In this library, multiple matrix decomposition algorithms are implemented.
See description in the following papers:
Probablistic matrix/tensor factorization:
A) Liang Xiong, Xi Chen, Tzu-Kuo Huang, Jeff Schneider, Jaime G. Carbonell,
Temporal Collaborative Filtering with Bayesian Probabilistic Tensor Factorization. In Proceedings of SIAM Data Mining, 2010.
html (source code is also available).
B) Salakhutdinov and Mnih, Bayesian Probabilistic Matrix Factorization using Markov
Chain Monte Carlo. in International Conference on Machine Learning, 2008.
pdf project website, since our code implements matrix factorization as a sepcial case
of a tensor as well.
C) Alternating least squares: Yunhong Zhou, Dennis Wilkinson, Robert Schreiber and Rong Pan.
Large-Scale Parallel Collaborative Filtering for the Netflix Prize. Proceedings of the 4th international conference on Algorithmic Aspects in Information and Management.
Shanghai, China pp. 337-348, 2008. pdf
D) SVD++ algorithm: Koren, Yehuda. "Factorization meets the neighborhood: a multifaceted collaborative filtering model." In Proceeding of the 14th ACM SIGKDD
international conference on Knowledge discovery and data mining, 426434. ACM, 2008. http://portal.acm.org/citation.cfm?id=1401890.1401944
E) SGD (sotchastic gradient descent) algorithm:
Matrix Factorization Techniques for Recommender Systems
Yehuda Koren, Robert Bell, Chris Volinsky
In IEEE Computer, Vol. 42, No. 8. (07 August 2009), pp. 30-37.
F) Tikk, D. (2009). Scalable Collaborative Filtering Approaches for Large Recommender Systems. Journal of Machine Learning Research, 10, 623-656.
G) For Lanczos algorithm (SVD) see: wikipedia.
H) For NMF (non-negative matrix factorization) see: Lee, D..D., and Seung, H.S., (2001), 'Algorithms for Non-negative Matrix
Factorization', Adv. Neural Info. Proc. Syst. 13, 556-562.
I) For Weighted-Alternating least squares: Collaborative Filtering for Implicit Feedback Datasets
Hu, Y.; Koren, Y.; Volinsky, C. IEEE International Conference on Data Mining (ICDM 2008), IEEE (2008).
J) Pan, Yunhong Zhou, Bin Cao, Nathan N. Liu, Rajan Lukose, Martin Scholz, and Qiang Yang. 2008. One-Class Collaborative Filtering. In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining (ICDM '08). IEEE Computer Society, Washington, DC, USA, 502-511.
K) For sparse factor matrices see:
Xi Chen, Yanjun Qi, Bing Bai, Qihang Lin and Jaime Carbonell. Sparse Latent Semantic Analysis. In SIAM International Conference on Data Mining (SDM), 2011.
D. Needell, J. A. Tropp
CoSaMP: Iterative signal recovery from incomplete and inaccurate samples
Applied and Computational Harmonic Analysis, Vol. 26, No. 3. (17 Apr 2008), pp. 301-321.
L) For SVD see Wikipedia
M) For time-SVD++, see
Yehuda Koren. 2009. Collaborative filtering with temporal dynamics. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD '09). ACM, New York, NY, USA, 447-456. DOI=10.1145/1557019.1557072
N) For bias-SVD
Y. Koren. Factorization Meets the Neighborhood: a Multifaceted
Collaborative Filtering Model. Equation (5), pdf.
O) For RBM:
G. Hinton. A Practical Guide to Training
Restricted Boltzmann Machines. University of Toronto Tech report UTML TR 2010-003
pdf.
5 April, 2011: Updated instructions on how to install and run Graphlab matrix factorization with Yahoo! KDD CUP data are found
here
Target
This code provides a highly efficient implementation of matrix/tensor factorization on multicore machines. Currently it was tested with tensor
of size 16M x 16M x 100 with 1,000,000,000 non-zero entries, with up to 32 cores.
Requirements
- it++ or Eigen should be installed. it++ is a c++ wrapper for the efficient BLAS/LaPaCK linear algebra packages.
Detailed itpp installation procedure.
- Alternatively, we now support also Eigen linear algebra package. For configuring GraphLab to be used with Eigen, simply use the --eigen command line flag when using ./configure.
- GraphLab should be installed. Follow installation instructions here.
- Useful tip: it is advised to join our GraphLab users Google group. We post updates, tips and installation instructions there.
- Memory requirement: each 8GB of memory is used for around 150,000,000 non-zero ratings. So a machine with 64GB memory can easily handle
1,000,000,000 non-zero ratings.
Program input
GraphLab collaborative filtering library has three inputs files: training, validation and test. Only the training input is mandatory, the validation and test inputs are optional.
The training is used for training the model on observed user-item ratings,
validation file is used to assess the quality of the trained model, and test data is used to predict unobserved ratings.
The convension used: assume foobar is the training, then validation should be foobare and test should be foobart.
The input is a sparse tensor (or matrix) prepared in on of the following ways:
- Using text file in Matrix Market sparse matrix format. This is the most recommended input option.
- Using Octave/matlab.
- Using Python.
- Mahout SVD sequence files (SequentialAccessSparseVector format)
Matrix Market input files
GraphLab supports sparse Matrix Market
input files. Don't forget to use the flag --matrixmarket=true when running GraphLab, when you use this format.
Matlab/Octave input
There are several ways to prepare the program input. Perhapes the easiest is using the save_c_gl_mat.m script. The input to the script is a matrix of size NumOfRatings X 4.
Each row contains the following format: [ from user] [ to movie ] [ time ] [ rating ].
Note: when weighted alternaitng least squares is used, then [time] could be also [weight/confidence in ratings]
Note: like Matlab, user ids should start from 1, movie ids should start from 1, and time bins start from one.
Another useful script is prepare_dataset.m
It takes as input a text file of the row format
[user#] [movie#] [rating]
or
[user#] [movie#] [time/weight] [ rating]
and converts it into GraphLab PMF format.
For example, the following Matlab code generates a random 5x12 matrix, and
creates two Graphlab PMF inputs, (training and validation), by splitting the data
randomly 90% for training and the rest for validation:
A=rand(5,12);
[a,b,c]=find(A);
A = [a b c];
save -ascii tempA.txt A;
prepare_dataset('tempA.txt', 'svd5x12', .90);
The result is the file svd5x12 (training) and svd5x12e (validation).
Note: you should be careful, that when generation validation file, it should have the same number of users and movies.
A third useful Matlab script is convert2seq.m
which takes an adjacency list of N random node integer ids, and translates the list into a
consecutive node ids between 0 to N-1. For example, assume our graph is a network of connections
between IP addresses, where we have 100 IP addresses and 200 connections between them. Since GraphLab
uses a continuous range of node ids, it is desirable to number the nodes between 0-99, and keep a mapping
between each node id into its original IP address.
Python input
If you like to use Python for preparing the inputs, Python script example is found here.
Mahout/Hadoop SequentialAccessSparseVector input files
Use the following instructions for converting Mahout's SVD input files to GraphLab's.
Program output
The output of the program are three matrices U,V and T of size dim1 X D, dim2 X D and dim3 X D.
The output is generated to a file named [inputfile].out
There are several supported formats for the program output.
- Matrix Market sparse matrix output (recommended!)
- Matlab output
- Python (binary) output
Matrix market output
When using the flag --matrixmarket=true, the output matrices U and V will be generated using two sparse matrix market output files.
You can load the files using Matlab/Octave using the script mmread.m.
Matlab/Octave output
You can read the program output in Matlab/Octave using the command:
>> itload('netflix.out')
>> whos
Name Size Bytes Class Attributes
Time 27x30 6480 double
User 95526x30 22926240 double
Movie 3561x30 854640 double
1) Download itload here and save it in your release/demoapps/pmf/ working folder.
Python (binary) output
Here is a script for reading the output in Python, thanks to Timmy Wilson, Smarttypes.org. parse_graphlab_pmf.py. You should use the flag --binaryoutput=true for this format to be selected.
Computing predictions
There are three possible ways to compute predictions of user/item pairs:
- Using test input file you can get prediction for a predefied list of user/item pairs.
- Using glcluster program for finding the top K predictions.
- Using Matlab/Octave.
Here are some more details:
Using test input file
Assume your training input file is mydataset, your test input file should be called mydatasett (namely a "t" was
appended at the end of the filename). The test file includes user/item pairs to compute prediction on. It has the
exact same input of the training and validation files. (The value given in the test file for the prediction is simply ignored).
At the end of the run you will get an output file with the scalar prediction computed for each user/item pair.
Using glcluster program
For computing the top K predictions you can use the glcluster program as follows.
Assume you run pmf on a file called "chapters.mm". The result will are two files chapters.mm.U and chapters.mm.V. To compute the recommended ratings, run:
ln -s chapters.U chapters
ln -s chapters.V chapterse
./glcluster chapters 8 3 0 --matrixmarket=true --training_ref=chapters.mm --ncpus=8
The output of this command is:
1) chapters.scalar-ratings.mtx
2) chapters.recommended-items.mtx
The file recommend-items include ids of items (starting from 0 and not from zero!) for each user. Each user is in a new row.
The file scalar-ratings lists the computed scalar rating for each of the recommended items.
Using Matlab
For computing prediction of entry $x_{i,j,k}$ in the tensor, you should simply compute the product in Matlab:
>> sum(User(i,:) .* Movie(j,:) .* Time(k,:))
3) Alternatively, if this is a matrix, you can compute:
>> sum(User(i,:) .* Movie(j,:))
For example, if you want to predict user i=3 to movie j=8 you can do:
>> sum(User(3,:) .* Movie(8,:))
And you will get a scalar result that predicts the rating based on the linear model.
Unit testing
After installation, a good idea would be to try first the unit testing.
cd release/tests
./runtests.sh 1
You will see a report of all unit tests and their results. In case of any failure, please email the resulting output file stdout.log to danny.bickson@gmail.com .
Running PMF
Command line options
After preparing the GraphLab input file using the save_c_gl_mat.m script, you should run:
./PMF [input file] [run mode] --scheduler="round_robin(max_iterations=XXX,block_size=1)"
// where XXX is the number of desired iterations
The following are the supported run modes:
0 = Matrix factorization using alternating least squares
1 = Matrix factorization using MCMC procedure
2 = Tensor factorization using MCMC procedure, single edge
exist between user and movies
3 = Tensor factorization, using MCMC procedure with support
for multiple edges between user and movies in different times
4 = Tensor factorization using alternating least squars
5 = SVD++ - Koren's SVD++ algorithm
6 = SGD - stochastic gradient descient
7 = SVD - Lanczos algorithm
8 = NMF - non-negative matrix factorization algo of Lee and Seung
9 = Weighted alternating least squares
10 = Alternating least squares with sparse user factor matrix
11 = Alternating least squares with sparse user and movie factor matrices
12 = Alternating least squares with sparse movie factor matrix
13 = SVD - singular value decomposition (via double Lanczos method)
14 = time-SVD++ - Koren's method
15 = bias-SGD, stochastic gradient descient with user and item bias
16 = Restricted Bolzman Machines (RBM)
The following are optional parameters (short list). You can view the full list using the command "./pmf --help".
--matrixmarket=true - for matrix market format
--debug=true - display debug info
--npucs=XX - run with XX cpus
--D=XX - feature vectors length (reasonable values 5-300)
--lambda=XX - regularization parameter for matrices U and V (for ALS, default=1)
--aggregatevalidation=true - use validation data for training
--maxval=XX, --minval=XX it is recommended to set the min allowed ratings values and max allowed rating values to improve predictions.
Running example 1 - Alternating matrix factorization
1) Download movielens and movielense sample files.
Example dataset is Movielens 1M ratings. You can download a preprocessed file for GraphLab format.
movielens training file movielens test file
2) Run the following command
./pmf movielens_mm 0 --scheduler="round_robin(max_iterations=10,block_size=1)" --matrixmarket=true --lambda=0.065 --ncpus=2
You should see an output like:
./pmf movielens_mm 0 --scheduler="round_robin(max_iterations=10,block_size=1)" --matrixmarket=true --lambda=0.065 --ncpus=2
INFO: pmf.cpp(do_main:434): PMF/BPTF/ALS/SVD++/time-SVD++/SGD/Lanczos/SVD Code written By Danny Bickson, CMU
Send bug reports and comments to danny.bickson@gmail.com
WARNING: pmf.cpp(do_main:441): Program compiled with it++ Support
Setting run mode ALS_MATRIX (Alternating least squares)
WARNING: pmf.h(verify_setup:410): It is recommended to set min and max allowed matrix values to improve prediction quality, using the flags --minval=XX, --maxval=XX
INFO: pmf.cpp(start:285): ALS_MATRIX (Alternating least squares) starting
loading data file movielens_mm
Loading Matrix Market file movielens_mm TRAINING
Loading movielens_mm TRAINING
Matrix size is: USERS 6040 MOVIES 3952 TIME BINS 1
loading data file movielens_mme
Loading Matrix Market file movielens_mme VALIDATION
Loading movielens_mme VALIDATION
Matrix size is: USERS 6040 MOVIES 3952 TIME BINS 1
loading data file movielens_mmt
Loading Matrix Market file movielens_mmt TEST
Loading movielens_mmt TEST
skipping file
setting regularization weight to 0.065
INFO: asynchronous_engine.hpp(run:137): Worker (Sync) 1 started.
INFO: asynchronous_engine.hpp(run:137): Worker (Sync) 0 started.
ALS_MATRIX (Alternating least squares) for matrix (6040, 3952, 1):900000. D=20
pU=0.065, pV=0.065, pT=1, D=20
complete. Objective=6.17196e+06, TRAIN RMSE=3.7034 VALIDATION RMSE=3.7079.
INFO: pmf.cpp(run_graphlab:232): starting with scheduler: round_robin
max iterations = 10
step = 1
max_iterations = 10
INFO: asynchronous_engine.hpp(run:111): Worker 0 started.
INFO: asynchronous_engine.hpp(run:111): Worker 1 started.
Entering last iter with 1 total updates so far 9991
1.18921) Iter ALS_MATRIX (Alternating least squares) 1 Obj=2.37071e+06, TRAIN RMSE=2.2903 VALIDATION RMSE=0.9347.
Entering last iter with 2 total updates so far 19984
2.37481) Iter ALS_MATRIX (Alternating least squares) 2 Obj=572734, TRAIN RMSE=1.1217 VALIDATION RMSE=0.8934.
Entering last iter with 3 total updates so far 29974
3.55853) Iter ALS_MATRIX (Alternating least squares) 3 Obj=404727, TRAIN RMSE=0.9425 VALIDATION RMSE=0.8690.
Entering last iter with 4 total updates so far 39968
4.74525) Iter ALS_MATRIX (Alternating least squares) 4 Obj=338742, TRAIN RMSE=0.8624 VALIDATION RMSE=0.8605.
Entering last iter with 5 total updates so far 49960
5.92972) Iter ALS_MATRIX (Alternating least squares) 5 Obj=307645, TRAIN RMSE=0.8221 VALIDATION RMSE=0.8562.
Entering last iter with 6 total updates so far 59951
7.11545) Iter ALS_MATRIX (Alternating least squares) 6 Obj=290248, TRAIN RMSE=0.7988 VALIDATION RMSE=0.8536.
Entering last iter with 7 total updates so far 69943
8.30341) Iter ALS_MATRIX (Alternating least squares) 7 Obj=279441, TRAIN RMSE=0.7840 VALIDATION RMSE=0.8519.
Running example 2 - MCMC matrix factorization
~/newgraphlab/graphlabapi/debug/apps/pmf$ ./PMF netflix 3 --ncpus=16 --D=30 --max_iter=30 --burn_in=20 --scheduler="round_robin(max_iterations=15,block_size=1)"
setting regularization 1.000000e+01
setting run mode 3
INFO :pmf.cpp(main:1096): BPTF starting
loading data file netflix
Loading netflix train
Creating 3298163 edges...
.................loading data file netflixe
Loading netflixe test
Creating 545177 edges...
...BPTF for tensor (95526, 3561, 27):3298163. D=30
nuAlpha=1, Walpha=1, mu=0, muT=1, nu=30, beta=1, W=1, WT=1 BURN_IN=20
complete. Obj=2.30766e+07, TEST RMSE=3.7948.
sampled alpha is 0.999809
INFO :asynchronous_engine.hpp(run:56): Worker 0 started.
INFO :asynchronous_engine.hpp(run:56): Worker 1 started.
INFO :asynchronous_engine.hpp(run:56): Worker 2 started.
INFO :asynchronous_engine.hpp(run:56): Worker 3 started.
INFO :asynchronous_engine.hpp(run:56): Worker 4 started.
INFO :asynchronous_engine.hpp(run:56): Worker 5 started.
INFO :asynchronous_engine.hpp(run:56): Worker 6 started.
INFO :asynchronous_engine.hpp(run:56): Worker 7 started.
INFO :asynchronous_engine.hpp(run:56): Worker 8 started.
INFO :asynchronous_engine.hpp(run:56): Worker 9 started.
INFO :asynchronous_engine.hpp(run:56): Worker 10 started.
INFO :asynchronous_engine.hpp(run:56): Worker 11 started.
INFO :asynchronous_engine.hpp(run:56): Worker 12 started.
INFO :asynchronous_engine.hpp(run:56): Worker 14 started.
INFO :asynchronous_engine.hpp(run:56): Worker 15 started.
INFO :asynchronous_engine.hpp(run:56): Worker 13 started.
Entering last iter with 1
11.9759) Iter BPTF 1 Obj=2.31406e+07, TRAIN RMSE=3.7285 TEST RMSE=3.1044.
sampled alpha is 0.0719377
Entering last iter with 2
19.5264) Iter BPTF 2 Obj=5.4579e+06, TRAIN RMSE=1.7788 TEST RMSE=1.0372.
sampled alpha is 0.316193
Entering last iter with 3
27.0199) Iter BPTF 3 Obj=1.94329e+06, TRAIN RMSE=1.0122 TEST RMSE=1.0110.
sampled alpha is 0.97671
Entering last iter with 4
34.4465) Iter BPTF 4 Obj=1.85077e+06, TRAIN RMSE=0.9807 TEST RMSE=0.9986.
sampled alpha is 1.03911
Entering last iter with 5
41.9269) Iter BPTF 5 Obj=1.79675e+06, TRAIN RMSE=0.9583 TEST RMSE=0.9901.
sampled alpha is 1.08866
Entering last iter with 6
49.4382) Iter BPTF 6 Obj=1.77346e+06, TRAIN RMSE=0.9424 TEST RMSE=0.9800.
sampled alpha is 1.12418
Entering last iter with 7
56.8816) Iter BPTF 7 Obj=1.75737e+06, TRAIN RMSE=0.9283 TEST RMSE=0.9701.
sampled alpha is 1.15981
Entering last iter with 8
64.4001) Iter BPTF 8 Obj=1.75944e+06, TRAIN RMSE=0.9175 TEST RMSE=0.9622.
sampled alpha is 1.18641
Entering last iter with 9
71.9716) Iter BPTF 9 Obj=1.75971e+06, TRAIN RMSE=0.9095 TEST RMSE=0.9575.
sampled alpha is 1.20994
Entering last iter with 10
79.4099) Iter BPTF 10 Obj=1.7484e+06, TRAIN RMSE=0.8992 TEST RMSE=0.9541.
sampled alpha is 1.23627
Entering last iter with 11
86.894) Iter BPTF 11 Obj=1.72904e+06, TRAIN RMSE=0.8860 TEST RMSE=0.9512.
sampled alpha is 1.27441
Entering last iter with 12
94.3665) Iter BPTF 12 Obj=1.7215e+06, TRAIN RMSE=0.8775 TEST RMSE=0.9491.
sampled alpha is 1.29865
Entering last iter with 13
101.821) Iter BPTF 13 Obj=1.72508e+06, TRAIN RMSE=0.8725 TEST RMSE=0.9491.
sampled alpha is 1.31404
Entering last iter with 14
109.313) Iter BPTF 14 Obj=1.7315e+06, TRAIN RMSE=0.8691 TEST RMSE=0.9488.
sampled alpha is 1.32381
Entering last iter with 15
116.799) Iter BPTF 15 Obj=1.74022e+06, TRAIN RMSE=0.8661 TEST RMSE=0.9501.
sampled alpha is 1.33515
Entering last iter with 16
124.317) Iter BPTF 16 Obj=1.74469e+06, TRAIN RMSE=0.8626 TEST RMSE=0.9502.
sampled alpha is 1.34253
Entering last iter with 17
131.859) Iter BPTF 17 Obj=1.74901e+06, TRAIN RMSE=0.8593 TEST RMSE=0.9512.
sampled alpha is 1.3544
Entering last iter with 18
139.371) Iter BPTF 18 Obj=1.75623e+06, TRAIN RMSE=0.8563 TEST RMSE=0.9520.
sampled alpha is 1.3633
Entering last iter with 19
146.946) Iter BPTF 19 Obj=1.76375e+06, TRAIN RMSE=0.8535 TEST RMSE=0.9513.
Finished burn-in period. starting to aggregate samples
sampled alpha is 1.37124
Entering last iter with 20
154.395) Iter BPTF 20 Obj=1.77025e+06, TRAIN RMSE=0.8511 TEST RMSE=0.9514.
sampled alpha is 1.37925
Entering last iter with 21
161.914) Iter BPTF 21 Obj=1.77484e+06, TRAIN RMSE=0.8483 TEST RMSE=0.9515.
sampled alpha is 1.39034
Entering last iter with 22
169.413) Iter BPTF 22 Obj=1.748e+06, TRAIN RMSE=0.8347 TEST RMSE=0.9332.
sampled alpha is 1.43515
Entering last iter with 23
176.906) Iter BPTF 23 Obj=1.7427e+06, TRAIN RMSE=0.8281 TEST RMSE=0.9262.
sampled alpha is 1.45834
Entering last iter with 24
184.414) Iter BPTF 24 Obj=1.73938e+06, TRAIN RMSE=0.8233 TEST RMSE=0.9224.
sampled alpha is 1.47744
Entering last iter with 25
191.896) Iter BPTF 25 Obj=1.73633e+06, TRAIN RMSE=0.8191 TEST RMSE=0.9199.
sampled alpha is 1.49088
Entering last iter with 26
199.381) Iter BPTF 26 Obj=1.7327e+06, TRAIN RMSE=0.8152 TEST RMSE=0.9178.
sampled alpha is 1.50575
Entering last iter with 27
206.87) Iter BPTF 27 Obj=1.73199e+06, TRAIN RMSE=0.8116 TEST RMSE=0.9161.
sampled alpha is 1.51825
Entering last iter with 28
214.36) Iter BPTF 28 Obj=1.7314e+06, TRAIN RMSE=0.8082 TEST RMSE=0.9148.
sampled alpha is 1.53222
Entering last iter with 29
221.862) Iter BPTF 29 Obj=1.72802e+06, TRAIN RMSE=0.8050 TEST RMSE=0.9136.
sampled alpha is 1.5427
INFO :asynchronous_engine.hpp(run:66): Worker 7 finished.
INFO :asynchronous_engine.hpp(run:66): Worker 2 finished.
INFO :asynchronous_engine.hpp(run:66): Worker 8 finished.
INFO :asynchronous_engine.hpp(run:66): Worker 5 finished.
INFO :asynchronous_engine.hpp(run:66): Worker 15 finished.
INFO :asynchronous_engine.hpp(run:66): Worker 14 finished.
INFO :asynchronous_engine.hpp(run:66): Worker 4 finished.
INFO :asynchronous_engine.hpp(run:66): Worker 1 finished.
INFO :asynchronous_engine.hpp(run:66): Worker 6 finished.
INFO :asynchronous_engine.hpp(run:66): Worker 3 finished.
INFO :asynchronous_engine.hpp(run:66): Worker 11 finished.
INFO :asynchronous_engine.hpp(run:66): Worker 12 finished.
INFO :asynchronous_engine.hpp(run:66): Worker 10 finished.
INFO :asynchronous_engine.hpp(run:66): Worker 9 finished.
INFO :asynchronous_engine.hpp(run:66): Worker 13 finished.
INFO :asynchronous_engine.hpp(run:66): Worker 0 finished.
Final result. Obj=1.72124e+06, TEST RMSE= 0.9137.
Finished in 225.948695
Running example 3: Yahoo! KDD Cup 2011 - Track1
For explanations on how to use GraphLab Yahoo! KDD Cup, including conversion of the data using Matlab or Python see:
my blog.
Running example 4: BPTF (Bayesian monte carlo matrix factorization) using Twitter social graph
This example was donated by Timmy Wilson @ smarttypes.org. It contains a twitter network of
68 followers, 11646 followies, 1 day and 15883 links. Download the input file here
<29|0>bickson@biggerbro:~/newgraphlab/graphlabapi/debug/demoapps/pmf$ ./pmf smarttypes_pmf 1 --scheduler="round_robin(max_iterations=20,block_size=1)" --float=true
INFO: pmf.cpp(main:1260): PMF/ALS/SVD++/SGD Code written By Danny Bickson, CMU
Send bug reports and comments to danny.bickson@gmail.com
WARNING: pmf.cpp(main:1262): Code compiled with GL_NO_MULT_EDGES flag - this mode does not support multiple edges between user and movie in different times
Setting run mode BPTF_MATRIX
INFO: pmf.cpp(main:1309): BPTF_MATRIX starting
loading data file smarttypes_pmf
Loading smarttypes_pmf TRAINING
Matrix size is: USERS 68 MOVIES 11646 TIME BINS 1
Creating 15883 edges (observed ratings)...
.loading data file smarttypes_pmfe
Loading smarttypes_pmfe VALIDATION
skipping file
loading data file smarttypes_pmft
Loading smarttypes_pmft TEST
skipping file
setting regularization weight to 1
BPTF_MATRIX for matrix (68, 11646, 1):15883. D=20
pU=1, pV=1, pT=1, muT=1, D=20
nuAlpha=1, Walpha=1, mu=0, muT=1, nu=20, beta=1, W=1, WT=1 BURN_IN=10
complete. Obj=7576.43, TRAIN RMSE=0.9513 VALIDATION RMSE=nan.
sampled alpha is 0.997129
max iterations = 20
step = 1
max_iterations = 20
INFO: asynchronous_engine.hpp(run:94): Worker 0 started.
INFO: asynchronous_engine.hpp(run:94): Worker 1 started.
Entering last iter with 1
0.361552) Iter BPTF_MATRIX 1 Obj=4646.35, TRAIN RMSE=0.7270 VALIDATION RMSE=nan.
sampled alpha is 1.90087
Entering last iter with 2
0.728271) Iter BPTF_MATRIX 2 Obj=1698.21, TRAIN RMSE=0.3103 VALIDATION RMSE=nan.
sampled alpha is 10.4702
Entering last iter with 3
1.12834) Iter BPTF_MATRIX 3 Obj=1368.41, TRAIN RMSE=0.1506 VALIDATION RMSE=nan.
sampled alpha is 44.3981
Entering last iter with 4
1.49151) Iter BPTF_MATRIX 4 Obj=1276.31, TRAIN RMSE=0.1245 VALIDATION RMSE=nan.
sampled alpha is 63.932
Entering last iter with 5
1.89511) Iter BPTF_MATRIX 5 Obj=1203.64, TRAIN RMSE=0.0904 VALIDATION RMSE=nan.
sampled alpha is 122.476
Entering last iter with 6
2.25427) Iter BPTF_MATRIX 6 Obj=1178.26, TRAIN RMSE=0.0744 VALIDATION RMSE=nan.
sampled alpha is 180.563
Entering last iter with 7
2.65659) Iter BPTF_MATRIX 7 Obj=1170.38, TRAIN RMSE=0.0575 VALIDATION RMSE=nan.
sampled alpha is 297.039
Entering last iter with 8
3.02014) Iter BPTF_MATRIX 8 Obj=1160.73, TRAIN RMSE=0.0477 VALIDATION RMSE=nan.
sampled alpha is 419.463
Entering last iter with 9
3.42518) Iter BPTF_MATRIX 9 Obj=1162.77, TRAIN RMSE=0.0394 VALIDATION RMSE=nan.
Finished burn-in period. starting to aggregate samples
sampled alpha is 610.536
Entering last iter with 10
3.79515) Iter BPTF_MATRIX 10 Obj=1161.87, TRAIN RMSE=0.0341 VALIDATION RMSE=nan.
sampled alpha is 810.82
Entering last iter with 11
4.19491) Iter BPTF_MATRIX 11 Obj=1469.61, TRAIN RMSE=0.1970 VALIDATION RMSE=nan.
sampled alpha is 25.4017
Entering last iter with 12
4.56205) Iter BPTF_MATRIX 12 Obj=1484.45, TRAIN RMSE=0.2007 VALIDATION RMSE=nan.
sampled alpha is 24.5661
Entering last iter with 13
4.96378) Iter BPTF_MATRIX 13 Obj=1230.12, TRAIN RMSE=0.0700 VALIDATION RMSE=nan.
sampled alpha is 203.111
Entering last iter with 14
5.33124) Iter BPTF_MATRIX 14 Obj=1229.07, TRAIN RMSE=0.0718 VALIDATION RMSE=nan.
sampled alpha is 193.54
Entering last iter with 15
5.72784) Iter BPTF_MATRIX 15 Obj=1209.51, TRAIN RMSE=0.0424 VALIDATION RMSE=nan.
sampled alpha is 536.412
Entering last iter with 16
6.101) Iter BPTF_MATRIX 16 Obj=1214.21, TRAIN RMSE=0.0419 VALIDATION RMSE=nan.
sampled alpha is 555.104
Entering last iter with 17
6.49673) Iter BPTF_MATRIX 17 Obj=1212.21, TRAIN RMSE=0.0310 VALIDATION RMSE=nan.
sampled alpha is 1000.04
Entering last iter with 18
6.87056) Iter BPTF_MATRIX 18 Obj=1215.99, TRAIN RMSE=0.0307 VALIDATION RMSE=nan.
sampled alpha is 987.797
Entering last iter with 19
7.2658) Iter BPTF_MATRIX 19 Obj=1217.74, TRAIN RMSE=0.0237 VALIDATION RMSE=nan.
sampled alpha is 1596.85
Entering last iter with 20
7.64149) Iter BPTF_MATRIX 20 Obj=1224.86, TRAIN RMSE=0.0233 VALIDATION RMSE=nan.
sampled alpha is 1677.19
INFO: asynchronous_engine.hpp(run:102): Worker 1 finished.
INFO: asynchronous_engine.hpp(run:102): Worker 0 finished.
Final result. Obj=1222.59, TRAIN RMSE= 0.0155 VALIDATION RMSE= nan.
Finished in 7.686977
Performance counters are: 0) EDGE_TRAVERSAL, 0.735296
Performance counters are: 1) BPTF_SAMPLE_STEP, 0.803732
Performance counters are: 2) CALC_RMSE_Q, 0.005395
Performance counters are: 6) CALC_OBJ, 0.028909
Performance counters are: 7) BPTF_MVN_RNDEX, 4.1201
Performance counters are: 8) BPTF_LEAST_SQUARES2, 1.15168
=== REPORT FOR core() ===
[Numeric]
ncpus: 2
[Other]
affinities: false
compile_flags:
engine: async
scheduler: round_robin
schedyield: true
scope: edge
=== REPORT FOR engine() ===
[Numeric]
num_edges: 15883
num_syncs: 0
num_vertices: 11714
updatecount: 234280
[Timings]
runtime: 7.6 s
[Other]
termination_reason: task depletion (natural)
[Numeric]
updatecount_vector: 234280 (count: 2, min: 117120, max: 117160, avg: 117140)
updatecount_vector.values: 117120,117160,
<30|0>bickson@biggerbro:~/newgraphlab/graphlabapi/debug/demoapps/pmf$
Running example 5 - implicit rating via weighted-ALS
This example shows how Graphlab collaborative filtering can handle implicit rating datasets.
It is recommended to read the paper: One-Class Collaborative Filtering
by: Rong Pan, Yunhong Zhou, Bin Cao, N. N. Liu, R. Lukose, M. Scholz, Qiang Yang. Data Mining, IEEE International Conference on In Data Mining, 2008. ICDM '08, for understanding the construction.
./pmf netflix 9 --scheduler="round_robin(max_iterations=10,block_size=1)" --zero=true --implicitratingtype=uniform --implicitratingpercentage=0.03 --implicitratingvalue=0 --implicitratingweight=0.5
Starting program: /mnt/bigbrofs/usr6/bickson/newgraphlab/graphlabapi/debug/demoapps/pmf/pmf netflix 9 --scheduler="round_robin(max_iterations=10,block_size=1)" --zero=true --implicitratingtype=uniform --implicitratingpercentage=0.03 --implicitratingvalue=0 --implicitratingweight=0.5
warning: no loadable sections found in added symbol-file system-supplied DSO at 0x7fffa63fd000
[Thread debugging using libthread_db enabled]
[New Thread 47893315401712 (LWP 8946)]
INFO: pmf.cpp(do_main:417): PMF/BPTF/ALS/SVD++/SGD/SVD Code written By Danny Bickson, CMU
Send bug reports and comments to danny.bickson@gmail.com
WARNING: pmf.cpp(do_main:424): Program compiled with it++ Support
Setting run mode Weighted alternating least squares
INFO: pmf.cpp(start:269): Weighted alternating least squares starting
loading data file netflix
Loading netflix TRAINING
Matrix size is: USERS 95526 MOVIES 3561 TIME BINS 27
Creating 3298163 edges (observed ratings)...
.................INFO: implicit.hpp(add_implicit_edges:77): added 9881029 implicit edges, rating=0 weight=0.5 type=uniform
loading data file netflixe
Loading netflixe VALIDATION
Matrix size is: USERS 95526 MOVIES 3561 TIME BINS 27
Creating 545177 edges (observed ratings)...
...loading data file netflixt
Loading netflixt TEST
skipping file
setting regularization weight to 1
Weighted alternating least squares for matrix (95526, 3561, 27):13179192. D=20
pU=1, pV=1, pT=1, D=20
complete. Objective=1.99427e+08, TRAIN RMSE=5.5012 VALIDATION RMSE=11.6063.
[New Thread 1199630672 (LWP 8976)]
INFO: pmf.cpp(run_graphlab:219): starting with scheduler: round_robin
max iterations = 10
step = 1
max_iterations = 10
INFO: asynchronous_engine.hpp(run:111): Worker 0 started.
[New Thread 1216416080 (LWP 8978)]
INFO: asynchronous_engine.hpp(run:111): Worker 1 started.
Entering last iter with 1
28.4887) Iter Weighted alternating least squares 1 Obj=1.84392e+08, TRAIN RMSE=5.2859 VALIDATION RMSE=8.2575.
Entering last iter with 2
56.6783) Iter Weighted alternating least squares 2 Obj=5.09993e+07, TRAIN RMSE=2.7740 VALIDATION RMSE=5.3221.
Entering last iter with 3
84.1871) Iter Weighted alternating least squares 3 Obj=3.59321e+07, TRAIN RMSE=2.3284 VALIDATION RMSE=4.9189.
Entering last iter with 4
113.502) Iter Weighted alternating least squares 4 Obj=3.10098e+07, TRAIN RMSE=2.1633 VALIDATION RMSE=4.7755.
The relevant command line flags related to implicit ratings are:
--implicitratingtype=user or --implicitratingtype=uniform
Adds implicit edges proportional to the current user edge num, or uniformly to every user.
--implicitratingpercentage - a number between 0 to 1 which determines what is the precentage of edges to add to the sparse model.
0 means none while 1 means fully dense model.
--implicitratingvale - what is the value of the rating added. On default it is zero, but you can change it.
--implicitratingweight - what is the weight of the implicit rating (or time). On default it is one.
Running example 6: Netflix data with sparse movie factor matrix
In this example we show how to factorize netflix data, with the requirement that
90% of the movie factor matrix will be zeros. Next, you can use the sparse matrices
for performing clustering of similar user or movies together into related groups.
bickson@biggerbro:~/newgraphlab/graphlabapi/debug/demoapps/pmf $ ./pmf netflix 12
--scheduler="round_robin(max_iterations=10,block_size=1)" --float=false
--ncpus=8 --desired_factor_sparsity=0.9 --lambda=0.06
INFO: pmf.cpp(main:565): PMF/BPTF/ALS/SVD++/SGD/SVD Code written
By Danny Bickson, CMU
Send bug reports and comments to danny.bickson@gmail.com
WARNING: pmf.cpp(main:567): Code compiled with GL_NO_MULT_EDGES flag
- this mode does not support multiple edges between user and movie in
different times
WARNING: pmf.cpp(main:570): Code compiled with GL_NO_MCMC flag - this
mode does not support MCMC methods.
Setting run mode Alternating least squares with sparse movie factor
matrix
INFO: pmf.cpp(start:370): Alternating least squares with sparse
movie factor matrix starting
loading data file netflix
Loading netflix TRAINING
Matrix size is: USERS 95526 MOVIES 3561 TIME BINS 27
Creating 3298163 edges (observed ratings)...
.................loading data file netflixe
Loading netflixe VALIDATION
Matrix size is: USERS 95526 MOVIES 3561 TIME BINS 27
Creating 545177 edges (observed ratings)...
...loading data file netflixt
Loading netflixt TEST
skipping file
setting regularization weight to 0.06
Alternating least squares with sparse movie factor matrix for matrix
(95526, 3561, 27):3298163. D=20
pU=0.06, pV=0.06, pT=1, D=20
Current sparsity : 0 %
complete. Objective=1.30614e+07, TRAIN RMSE=2.8139 VALIDATION
RMSE=2.8790.
max iterations = 10
step = 1
max_iterations = 10
INFO: asynchronous_engine.hpp(run:111): Worker 0 started.
INFO: asynchronous_engine.hpp(run:111): Worker 1 started.
INFO: asynchronous_engine.hpp(run:111): Worker 2 started.
INFO: asynchronous_engine.hpp(run:111): Worker 3 started.
INFO: asynchronous_engine.hpp(run:111): Worker 4 started.
INFO: asynchronous_engine.hpp(run:111): Worker 6 started.
INFO: asynchronous_engine.hpp(run:111): Worker 7 started.
INFO: asynchronous_engine.hpp(run:111): Worker 5 started.
Entering last iter with 1
Current sparsity : 0.9
2.61367) Iter Alternating least squares with sparse movie factor
matrix 1 Obj=8.39139e+06, TRAIN RMSE=2.2338 VALIDATION RMSE=2.4251.
Entering last iter with 2
Current sparsity : 0.95 %
5.25192) Iter Alternating least squares with sparse movie factor
matrix 2 Obj=2.52153e+06, TRAIN RMSE=1.2152 VALIDATION RMSE=1.6419.
Entering last iter with 3
Current sparsity : 0.9
7.88379) Iter Alternating least squares with sparse movie factor
matrix 3 Obj=2.36985e+06, TRAIN RMSE=1.1787 VALIDATION RMSE=1.3749.
Entering last iter with 4
Current sparsity : 0.9
10.5112) Iter Alternating least squares with sparse movie factor
matrix 4 Obj=2.57171e+06, TRAIN RMSE=1.2280 VALIDATION RMSE=1.3589.
Entering last iter with 5
Current sparsity : 0.9
13.0986) Iter Alternating least squares with sparse movie factor
matrix 5 Obj=2.76916e+06, TRAIN RMSE=1.2758 VALIDATION RMSE=1.3188.
Entering last iter with 6
Current sparsity : 0.9 %
15.7324) Iter Alternating least squares with sparse movie factor
matrix 6 Obj=2.74914e+06, TRAIN RMSE=1.2721 VALIDATION RMSE=1.2410.
Entering last iter with 7
Current sparsity : 0.9
18.3847) Iter Alternating least squares with sparse movie factor
matrix 7 Obj=2.53998e+06, TRAIN RMSE=1.2239 VALIDATION RMSE=1.0778.
Entering last iter with 8
Current sparsity : 0.9
20.9803) Iter Alternating least squares with sparse movie factor
matrix 8 Obj=1.84584e+06, TRAIN RMSE=1.0436 VALIDATION RMSE=0.9723.
Entering last iter with 9
Current sparsity : 0.9
23.6121) Iter Alternating least squares with sparse movie factor
matrix 9 Obj=1.48064e+06, TRAIN RMSE=0.9341 VALIDATION RMSE=0.9608.
Entering last iter with 10
Current sparsity : 0.9
26.1979) Iter Alternating least squares with sparse movie factor
matrix 10 Obj=1.43894e+06, TRAIN RMSE=0.9217 VALIDATION RMSE=0.9596.
INFO: asynchronous_engine.hpp(run:119): Worker 4 finished.
INFO: asynchronous_engine.hpp(run:119): Worker 0 finished.
INFO: asynchronous_engine.hpp(run:119): Worker 2 finished.
INFO: asynchronous_engine.hpp(run:119): Worker 6 finished.
INFO: asynchronous_engine.hpp(run:119): Worker 1 finished.
INFO: asynchronous_engine.hpp(run:119): Worker 7 finished.
INFO: asynchronous_engine.hpp(run:119): Worker 5 finished.
INFO: asynchronous_engine.hpp(run:119): Worker 3 finished.
Current sparsity : 0.9
Final result. Obj=1.43894e+06, TRAIN RMSE= 0.9217 VALIDATION RMSE=
0.9596.
Finished in 26.611790 seconds
Performance counters are: 0) EDGE_TRAVERSAL, 49.7254
Performance counters are: 2) CALC_RMSE_Q, 0.001046
Performance counters are: 3) ALS_LEAST_SQUARES, 81.24
Performance counters are: 6) CALC_OBJ, 0.59852
=== REPORT FOR core() ===
[Numeric]
ncpus: 8
[Other]
affinities: false
compile_flags:
engine: async
scheduler: round_robin
schedyield: true
scope: edge
=== REPORT FOR engine() ===
[Numeric]
num_edges: 3.29816e+06
num_syncs: 0
num_vertices: 99087
updatecount: 990870
[Timings]
runtime: 26.2 s
[Other]
Running example 7: loading from matrix market sparse matrix format
1) Donload the input file smallnetflix_mm and smallnetflix_mme.
Those are text input files with the following format:
%%MatrixMarket matrix coordinate real general
% Generated 28-Aug-2011
95526 3561 3298163
13 1 1
83 1 2
127 1 2
136 1 5
137 1 4
1
Where 95526 is the number of users, 3561 is the number of users, and 3298163 is the number of ratings.
Each row has one rating: the first row, user 13 rated movie 1 and gave it a rating of 1.
2) Run alternating least suqares. Don't forget the switch --matrixmarket=true .
<33|1>bickson@biggerbro:~/newgraphlab/graphlabapi/release/demoapps/pmf$ ./pmf smallnetflix_mm 0 --matrixmarket=true --scheduler="round_robin(max_iterations=10,block_size=1)"
INFO: pmf.cpp(do_main:465): PMF/BPTF/ALS/SVD++/SGD/SVD Code written By Danny Bickson, CMU
Send bug reports and comments to danny.bickson@gmail.com
WARNING: pmf.cpp(do_main:467): Code compiled with GL_NO_MULT_EDGES flag - this mode does not support multiple edges between user and movie in different times
Setting run mode ALS_MATRIX (Alternating least squares)
INFO: pmf.cpp(start:308): ALS_MATRIX (Alternating least squares) starting
loading data file smallnetflix_mm
Loading Matrix Market file smallnetflix_mm TRAINING
Loading smallnetflix_mm TRAINING
Matrix size is: USERS 95526 MOVIES 3561 TIME BINS 1
loading data file smallnetflix_mme
Loading Matrix Market file smallnetflix_mme VALIDATION
Loading smallnetflix_mme VALIDATION
Matrix size is: USERS 95526 MOVIES 3561 TIME BINS 1
loading data file smallnetflix_mmt
Loading Matrix Market file smallnetflix_mmt TEST
Loading smallnetflix_mmt TEST
skipping file
setting regularization weight to 1
ALS_MATRIX (Alternating least squares) for matrix (95526, 3561, 1):3298163. D=20
pU=1, pV=1, pT=1, D=20
complete. Objective=2.26985e+07, TRAIN RMSE=3.7098 VALIDATION RMSE=3.7762.
max iterations = 10
step = 1
max_iterations = 10
INFO: asynchronous_engine.hpp(run:111): Worker 0 started.
INFO: asynchronous_engine.hpp(run:111): Worker 1 started.
Entering last iter with 1
4.52759) Iter ALS_MATRIX (Alternating least squares) 1 Obj=2.13107e+07, TRAIN RMSE=3.5919 VALIDATION RMSE=2.5129.
Entering last iter with 2
9.05031) Iter ALS_MATRIX (Alternating least squares) 2 Obj=2.76594e+06, TRAIN RMSE=1.2658 VALIDATION RMSE=1.4300.
....
Running Example 8: SVD++ with movielens data
You can download movielens data (training file is: movielens_mm, validation file is: movielens_mme) from here.
./pmf movielens_mm 5 --scheduler="round_robin(max_iterations=10,block_size=1)" --float=true --ncpus=8 --maxval=5 --minval=1 --matrixmarket=true
INFO: pmf.cpp(do_main:434): PMF/BPTF/ALS/SVD++/time-SVD++/SGD/Lanczos/SVD Code written By Danny Bickson, CMU
Send bug reports and comments to danny.bickson@gmail.com
WARNING: pmf.cpp(do_main:441): Program compiled with it++ Support
Setting run mode SVD++
INFO: pmf.cpp(start:285): SVD++ starting
loading data file movielens_mm
Loading Matrix Market file movielens_mm TRAINING
Loading movielens_mm TRAINING
Matrix size is: USERS 6040 MOVIES 3952 TIME BINS 1
loading data file movielens_mme
Loading Matrix Market file movielens_mme VALIDATION
Loading movielens_mme VALIDATION
Matrix size is: USERS 6040 MOVIES 3952 TIME BINS 1
loading data file movielens_mmt
Loading Matrix Market file movielens_mmt TEST
Loading movielens_mmt TEST
skipping file
...
SVD++ for matrix (6040, 3952, 1):900000. D=20
SVD++ 20 factors
complete. Objective=2.88718e-305, TRAIN RMSE=0.0000 VALIDATION RMSE=0.0000.
INFO: pmf.cpp(run_graphlab:232): starting with scheduler: round_robin
max iterations = 10
step = 1
max_iterations = 10
...
Entering last iter with 1
0.662835) Iter SVD 1, TRAIN RMSE=1.7591 VALIDATION RMSE=1.6971.
Entering last iter with 2
1.08171) Iter SVD 2, TRAIN RMSE=1.6513 VALIDATION RMSE=1.5921.
Entering last iter with 3
1.46447) Iter SVD 3, TRAIN RMSE=1.5506 VALIDATION RMSE=1.5136.
Entering last iter with 4
1.88655) Iter SVD 4, TRAIN RMSE=1.4606 VALIDATION RMSE=1.4388.
Entering last iter with 5
2.34319) Iter SVD 5, TRAIN RMSE=1.3883 VALIDATION RMSE=1.3739.
Entering last iter with 6
2.75348) Iter SVD 6, TRAIN RMSE=1.3335 VALIDATION RMSE=1.3205.
Entering last iter with 7
3.65773) Iter SVD 7, TRAIN RMSE=1.2506 VALIDATION RMSE=1.3780.
Other examples
Further examples are found in the datasets and benchmark page.
Debugging execution
To debug your dataset features, run with the --stats=true command line option.
For example: ./pmf netflix 0 --stats=true.
You can also use the --debug=true flag for having debug traces printed.
Acknowledgements
As the project is growing, the list of people we should thank is growing..