GraphLab cluster deployment – quick start
Note: the MPI section of this toturial is based on this excellent tutorial.
Preliminaries:
- Mpi should be installed
Step 0: Install GraphLab on one of your cluster nodes.
Using the instructions here on your master node (one of your cluster machines)
Step 1: start MPI
a) Create a file called .mpd.conf in your home directory (only once)
This file should contain a secret password using the format:
secretword=some_password_as_you_wish
b) Verify that MPI is working by running the daemon (only once)
|
1 2 3 |
$ mpd & # run the MPI daemon $ mpdtrace # lists all active MPI nodes $ mpdallexit # kill MPI |
c) Spawn the cluster. Create a file named machines with the list of machines you like to deploy.
Run (for mpich2)
|
1 |
$ mpd & |
d) Copy GraphLab files to all machines.
On the node you installed GraphLab on, run the following commands to copy GraphLab files to the rest of the machines:
d1) Verify you have the machines files from section 1c) in your root folder.
d2) Copy the GraphLab files
cd ~/graphlabapi/release/toolkits; ~/graphlabapi/scripts/mpirsync
cd ~/graphlabapi/deps/local; ~/graphlabapi/scripts/mpirsync
Step 2: Run GraphLab ALS
It first downloads the data from the web: http://www.select.cs.cmu.edu/code/graphlab/datasets/smallnetflix_mm.train and http://www.select.cs.cmu.edu/code/graphlab/datasets/smallnetflix_mm.validate, and runs 5 alternating least squares iterations. After the run is completed, you can login into any of the nodes and view the output files in the folder ~/graphlabapi/release/toolkits/collaborative_filtering/
|
1 2 3 4 5 |
cd /some/ns/folder/ mkdir smallnetflix cd smallnetflix/ wget http://www.select.cs.cmu.edu/code/graphlab/datasets/smallnetflix_mm.train wget http://www.select.cs.cmu.edu/code/graphlab/datasets/smallnetflix_mm.validate |
Now run GraphLab:
|
1 |
mpiexec -n 2 -f ~/machines /path/to/als --matrix /some/ns/folder/smallnetflix/ --max_iter=3 --ncpus=1 |
Where -n is the number of MPI nodes, and –ncpus is the number of deployed cores on each MPI node.
machines is a file which includes a list of the machines you like to deploy on (each machine in a new line)
Note: this section assumes you have a network storage (ns) folder where the input can be stored.
Alternatively, you can split the input into several disjoint files, and store the subsets on the cluster machines.
Note: Don’t forget to change /path/to/als and /some/ns/folder to your actual folder path!
Note: For openmpi, use –hostfile instead of -f .
Step 3:
Fine tuning graphlab deployment.
Errors and their resolution:
Error:
/mnt/info/home/daroczyb/als: error while loading shared libraries: libevent_pthreads-2.0.so.5: cannot open shared object file: No such file or directory
Solution:
You should define LD_LIBRARY_PATH to point to the location of libevent_pthreads, this is done with the -x mpi command, for example:
|
1 |
mpiexec --hostfile machines -x LD_LIBRARY_PATH=/home/daroczyb/graphlabapi/deps/local/lib/ /mnt/info/home/daroczyb/als /mnt/info/home/daroczyb/smallnetflix_mm.train |
Error:
/mnt/info/home/daroczyb/als: error while loading shared libraries: libjvm.so: cannot open shared object file: No such file or directory
Solution:
Point LD_LIBRARY_PATH to the location of libjvm.so using the -x mpi command:
|
1 |
mpiexec --hostfile machines -x LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/daroczyb/graphlabapi/deps/local/lib/:/usr/lib/jvm/java-7-openjdk-amd64/jre/lib/amd64/server/ /mnt/info/home/daroczyb/als /mnt/info/home/daroczyb/smallnetflix_mm.train |
Error:
|
1 |
problem with execution of /graphlabapi/release/toolkits/collaborative_filtering/als on debian1: [Errno 2] No such file or directory |
Solution:
You should verify the executable is found on the same path on all machines.
Error: a prompt asking for password when running mpiexec
Solution: Use the following instructions to allow connection with a public/private key pair (no password).
Error:
|
1 2 3 4 5 6 7 8 9 |
Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: hdfs://[domain]:9000/user/[user_name]/data.txt, expected: file:/// at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:381) at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:55) at org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:307) at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:842) at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:867) at org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:487) Call to org.apache.hadoop.fs.FileSystem::listStatus failed! WARNING: distributed_graph.hpp(load_from_hdfs:1889): No files found matching hdfs://[domain]:9000/user/[user_name]/data.txt |
Solution:
Verify you classpath includes all hadoop required folders.

