GraphLab

A New Parallel Framework for Machine Learning

Matlab To GraphLab

Matlab to C++

  1. Introduction
  2. System Requirement
  3. Embedded Matlab
  4. Matlab to GraphLab
    1. Graph Representation
    2. Update Functions
    3. Compilation
  5. Limitations
  6. Demo

Introduction

The goal is to simplify the process of GraphLab rapid propotyping by allow the user to write update functions in Matlab while still providing good parallel performance. The problem however, is that Matlab is not thread-safe, and providing a GraphLab->Matlab interface by simply using Mex will not offer any performance gains through the use of parallelism.

To solve this problem, we provide the Matlab to GraphLab compiler. The Matlab to GraphLab compiler is a toolkit which uses Matlab 2010b's EMLC compiler to translate update functions written in Matlab to native C code. The C code is then compiled into a GraphLab program. Additional interface functions are also provided which allow the user to interact with the GraphLab program from Matlab. The only caveat is that update functions must be written in Embedded Matlab: a subset of the full Matlab language.

Unfortunately not all of GraphLab is fully supported yet. We will continue making improvements available as we implement them.

System Requirements

Embedded Matlab

Embedded Matlab is a subset of the full Matlab language. In particular, it does not support


(Source:
http://www.mathworks.com/help/toolbox/eml/ug/bq37agh.html#bq37dee)

Only a (relatively large) subset of Matlab functions are supported.

In addition, the Matlab to GraphLab Compiler does not yet support

Typing

Embedded Matlab can be thought of as a "strongly typed" version of Matlab. The type of the variables cannot depend on the program flow.

For instance: the following code is not allowed since the type of x depends on the value of c:

% Source: http://www.mathworks.com/help/toolbox/eml/ug/bq2l74g.html if(c>0) x = int8(0); else x = [1 2 3]; end disp(x);

However, the following is allowed:

x = int8(0); x = [1 2 3];

In this case: the EMLC compiler will rename the second instance of x. See here for more details on the use of variables in Embedded Matlab.

Matlab To GraphLab

Graph Representation

The data graph is defined over two data types. A vertex data type, and an edge data type. In MatLab, this means that all vertices must be of the same data type, and all edges must be of a same (possibly different) data type.

For instance, if one vertex data is a double matrix, all vertex data must be double matrices. And if one edge data has a particular struct layout, All edge data must follow the identical struct layout.

For instance: if I were to define a pairwise Markov Random Field for Belief Propagation, I could define my vertex and edge data to be:

vdata.belief : 1 x N matrix vdata.unary : 1 x N matrix edata.msg : 1 x N matrix edata.binary: N x N matrix

For users familar with EMLC, you might know that EMLC defines arrays of different sizes to be different types. To simplify the use of the Matlab to GraphLab, the compiler automatically forces all array types to be variable sizes. Therefore the effective data type is:

vdata.belief : 1 x :? matrix vdata.unary : 1 x :? matrix edata.msg : 1 x :? matrix edata.binary: :? x :? matrix

Where each ":?" can be a different value for each vertex/edge, and can change at runtime.

The canonical representation we will use for the graph representation is the triplet (VDATA, ADJ, EDATA).

VDATA
is a cell array of vertex data where VDATA{i} is the data on vertex i. The length of this cell array is the number of vertices in the graph.
ADJ
A (sparse) adjacency matrix. If ADJ(i,j) > 0, there is an edge from vertex i to vertex j, and the data on the edge i->j is EDATA{ADJ(i,j)} The maximum dimensions of ADJ is #vertices * #vertices.
EDATA
A cell array of edge data. The ADJ adjacency matrix references into this array. It is possible for many edges to share the same data element. For instance, if I have many edges with the same value, it is possible for all their ADJ entries to point to a single entry in EDATA.

Update Functions

An update function must satisfy the following interface:

function bp_update(currentvertex, inedges, inv, outedges, outv, handle) %#eml

currentvertex
an integer ID of the vertex currently being evaluated
inedges
An array of integers identifying the set of edges entering the current vertex.
inv
An array of integers of the same length as inedges. inv[i] is the vertex ID of the source of the edge inedges[i].
outedges
An array of integers identifying the set of edges leaving the current vertex.
outv
An array of integers of the same length as outedges. outv[i] is the vertex ID of the source of the edge outedges[i].
handle
A double value. Do not modify. This should be passed directly to the "link" functions

The %#eml is to denote that this function is an Embedded Matlab function. It is optional. Omitting it will result in a compiler warning.

A useful property to note is that both inv and outv are sorted arrays. Therefore, if the entire graph is bi-directed, (i.e. inv == outv), then necessarily inedges[i] is the corresponding reverse edge for outedges[i].

In addition: the following functions are provided which allow the update functions access to the graph data and the scheduler. We call these functions "link" functions. (they allow the EML code to call GraphLab code) The handle parameter must be passed unmodified to these functions:

The following four functions provide access to vertex data and edge data.

function vdata = get_vertex_data(handle, vertex) % returns data on vertex function edata = get_edge_data(handle, edge) % returns data on edge function set_vertex_data(handle, vertex, vdata) % sets data on vertex function set_edge_data(handle, edge, edata) % sets data on edge

In addition, we provde an 'add_task' function which allows for dynamic scheduling.

function add_task(handle, vertex, 'function name', priority)

Compilation

Whenever the vertex data type / edge data type / or the update function changes, it is necessary to recompile the code.

The function compile_update_function in the graphlab/matlab directory is used to compile the code. (Do not move the compile_update_function.m file from that location! It uses its location to identify the directory for the GraphLab header files). help compile_update_function provides details on arguments and the compilation process.

compile_update_function is an m file which compiles a collection of update functions into a graphlab program.
   compile_update_function(UPDATE_FNS, EX_VERTEX, EX_EDGE,
                           GLLIB_DIR, WORKING_DIR, OUT_NAME,
                           OPT_LEVEL = 0, PAR_COMPILES = 4,
                           NO_TCMALLOC = 0);
This function is the wrapper around the Matlab -> GraphLab compilation process. It will use emlc to compile a collection of update function names as listed as strings in UPDATE_FNS into binary executables in the WORKING_DIR.
UPDATE_FNS
is a cell array of strings, where each string is the name of a GraphLab update function to compile. (see above for details on the update functions)
EX_VERTEX
This should be representative example of the data on a vertex.
EX_EDGE
This should be representative example of the data on a edge.
GLLIB_DIR
This should point to a directory where the GraphLab binary libraries can be found. For instance, this could be empty if all graphlab libraries are already in /usr/lib. Or if you compiled GraphLab without installing it, this could point to the release/src/graphlab directory.
WORKING_DIR
All intermediate source files and Makefiles will be stored in this directory. The resultant binaries will also be stored here. This directory will be created if it does not already exist
OUT_NAME
This is the base name to use for the final compiled binaries/m files.
OPT_LEVEL
Optimization level. corresponds to the -O[N] flag in 'gcc'. Defaults to 0.
PAR_COMPILES
This function will automatically begin building the Makefiles. This is the parallelization level for the build and corresponds to -j[N] flag in 'make'. Defaults to 4.
NO_TCMALLOC
This function will try to detect the existance of libtcmalloc through the use of 'whereis'. If detected, it will automatically try to link the binaries against libtcmalloc. If this flag is set to non-zero, we will never link against tcmalloc even if detected. Defaults to false.

Output

The script will poroduce a Makefile in the WORKING_DIR which will compile the following 3 binaries (OUT_NAME)_save_graph, (OUT_NAME)_binary and (OUT_NAME)_load_graph.




(OUT_NAME)_save_graph

(OUT_NAME)_save_graph is a mex library
            (OUT_NAME)_save_graph(VDATA, ADJ, EDATA, ...
                                       OPTIONS, GRAPHFILE, STRICT)
This function will serialize the graph described by VDATA, ADJ, EDATA as well as OPTIONS.initial_schedule to the file GRAPHFILE. If STRICT is set stricter type checking will be used.
VDATA
is a cell array of vertex data where VDATA{i} is the data on vertex i. The length of this cell array is the number of vertices in the graph.
ADJ
A (sparse) adjacency matrix. If ADJ(i,j) > 0, there is an edge from vertex i to vertex j, where the data on the edge i->j is EDATA{ADJ(i,j)} The maximum dimensions of ADJ is #vertices * #vertices.
EDATA
A cell array of edge data. The ADJ adjacency matrix references into this array. It is possible for many edges to share the same data element. For instance, if I have many edges with the same value, it is possible for all their ADJ entries to point to a single entry in EDATA.
OPTIONS
A struct of only one field "initial_schedule". (leaving room for future extensions)
         OPTIONS.initial_schedule = [SCHED1, SCHED2 ...]
                 A struct array describing the initial schedule where each
                 sched entry has the following format:
SCHED.update_function
A string. The name of the update function
SCHED.vertices
array of vertex ids to update
SCHED.priority
same size as vertices. The priority for each update. Must be >0

For instance, if I have two update functions 'u1' and 'u2' and I would like u1 to update vertices 1:100 with high priority, and u2 to update vertices 101:200 with lower priority. I could fill in the OPTIONS:

OPTIONS.initial_schedule(1).update_function = 'u1'; OPTIONS.initial_schedule(1).vertices = 1:100; OPTIONS.initial_schedule(1).priority = ones(1,100); OPTIONS.initial_schedule(2).update_function = 'u2'; OPTIONS.initial_schedule(2).vertices = 101:200; OPTIONS.initial_schedule(2).priority = 0.1 * ones(1,100);

GRAPHFILE
the file to output to.
STRICT
numeric. Whether strict type checking will be used.



(OUT_NAME)_binary

(OUT_NAME)_binary is a standalone program

This program takes in the standard GraphLab command line options, (run (OUT_NAME)_binary --help for details), as well two additional options.

      --ingraphfile=[INGRAPH] --outgraphfile=[OUTGRAPH]

INGRAPH will read the GRAPHFILE generated by (OUTPUT_M_NAME)_save_graph and run GraphLab on it, using the parameters specified on the command line, as well as the initial schedule passed into the OPTIONS parameter of (OUTPUT_M_NAME)_save_graph. When complete the result will be stored in OUTGRAPH which can be read be (OUTPUT_M_NAME)_load_graph.




(OUT_NAME)_load_graph

(OUT_NAME)_load_graph is a mex library
            [VDATA, ADJ, EDATA] = (OUT_NAME)_load_graph(GRAPHFILE)

This function will deserialize the graph in GRAPHFILE and return the result in VDATA, ADJ, EDATA. See the documentation for (OUT_NAME)_load_graph for the format of VDATA, ADJ, EDATA.

OUT_NAME.m

Finally we also output a matlab script with the name OUT_NAME.m.

This is an M file generated in the working directory which automates the calling of save_graph, the binary and the load_graph functions.

The interface is:

           [NEWVDATA, NEWADJ, NEWEDATA] = OUT_NAME(VDATA, ADJ, EDATA, ...
                                                   OPTIONS, STRICT)
[VDATA, ADJ, EDATA]
graph representation. See (OUT_NAME)_save_graph above for details
OPTIONS
A struct denoting the GraphLab parameters and the initial schedule.
OPTIONS.initial_schedule
(OUT_NAME)_save_graph above for details
OPTIONS.ncpus
A positive numeric denoting #threads to start.
OPTIONS.scheduler
A string describing the scheduler and the scheduler parameters. Identical to the --scheduler=... parameter when running a GraphLab program. Run (OUT_NAME)_binary --help for details
OPTIONS.scope
A string describing the scope type. Should be either 'edge', 'vertex' or 'full'. Identical to the --scope=... parameter when running a GraphLab program.

The script first calls (OUT_NAME)_save_graph with the arguments VDATA, ADJ, EDATA, OPTIONS.initial_schedule and STRUCT. The graph file name GRAPHFILE is automatically generated. The current directory must be writeable. Next, it will invoke the program (OUT_NAME)_binary passing it the rest of the parameters in OPTIONS. Finally the output of the binary is read back using (OUT_NAME)_load_graph and returned.

Limitations

1) Type limitations: The Matlab to GraphLab compiler does not yet support

2) Pseudo-Random numbers:

EMLC does provide random number generation through the family of rand() functions. However, they are not built to be thread-safe, so use at your own risk. Later improvements to the GraphLab to Matlab compiler will provide alternate substitutes for rand() which are thread safe. Even so, many typical Matlab functions should work with minimal modifications.

3) GraphLab's Shared Variables and Sync Operation are not supported

Future versions of the Matlab to GraphLab compiler may provide this. This is low priority at the moment.

Demo

src/graphlab/matlab/demo contains a Matlab cell mode demo.