The goal is to simplify the process of GraphLab rapid propotyping by allow the user to write update functions in Matlab while still providing good parallel performance. The problem however, is that Matlab is not thread-safe, and providing a GraphLab->Matlab interface by simply using Mex will not offer any performance gains through the use of parallelism.
To solve this problem, we provide the Matlab to GraphLab compiler. The Matlab to GraphLab compiler is a toolkit which uses Matlab 2010b's EMLC compiler to translate update functions written in Matlab to native C code. The C code is then compiled into a GraphLab program. Additional interface functions are also provided which allow the user to interact with the GraphLab program from Matlab. The only caveat is that update functions must be written in Embedded Matlab: a subset of the full Matlab language.
Unfortunately not all of GraphLab is fully supported yet. We will continue making improvements available as we implement them.
Embedded Matlab is a subset of the full Matlab language. In particular, it does not support
Only a (relatively large) subset of Matlab functions are supported.
In addition, the Matlab to GraphLab Compiler does not yet support
Embedded Matlab can be thought of as a "strongly typed" version of Matlab. The type of the variables cannot depend on the program flow.
For instance: the following code is not allowed since the type of x depends on the value of c:
% Source: http://www.mathworks.com/help/toolbox/eml/ug/bq2l74g.html if(c>0) x = int8(0); else x = [1 2 3]; end disp(x);
However, the following is allowed:
x = int8(0); x = [1 2 3];
In this case: the EMLC compiler will rename the second instance of x. See here for more details on the use of variables in Embedded Matlab.
The data graph is defined over two data types. A vertex data type, and an edge data type. In MatLab, this means that all vertices must be of the same data type, and all edges must be of a same (possibly different) data type.
For instance, if one vertex data is a double matrix, all vertex data must be double matrices. And if one edge data has a particular struct layout, All edge data must follow the identical struct layout.
For instance: if I were to define a pairwise Markov Random Field for Belief Propagation, I could define my vertex and edge data to be:
vdata.belief : 1 x N matrix vdata.unary : 1 x N matrix edata.msg : 1 x N matrix edata.binary: N x N matrix
For users familar with EMLC, you might know that EMLC defines arrays of different sizes to be different types. To simplify the use of the Matlab to GraphLab, the compiler automatically forces all array types to be variable sizes. Therefore the effective data type is:
vdata.belief : 1 x :? matrix vdata.unary : 1 x :? matrix edata.msg : 1 x :? matrix edata.binary: :? x :? matrix
Where each ":?" can be a different value for each vertex/edge, and can change at runtime.
The canonical representation we will use for the graph representation is the triplet (VDATA, ADJ, EDATA).
An update function must satisfy the following interface:
function bp_update(currentvertex, inedges, inv, outedges, outv, handle) %#eml
The %#eml is to denote that this function is an Embedded Matlab function. It is optional. Omitting it will result in a compiler warning.
A useful property to note is that both inv and outv are sorted arrays. Therefore, if the entire graph is bi-directed, (i.e. inv == outv), then necessarily inedges[i] is the corresponding reverse edge for outedges[i].
In addition: the following functions are provided which allow the update functions access to the graph data and the scheduler. We call these functions "link" functions. (they allow the EML code to call GraphLab code) The handle parameter must be passed unmodified to these functions:
The following four functions provide access to vertex data and edge data.
function vdata = get_vertex_data(handle, vertex) % returns data on vertex function edata = get_edge_data(handle, edge) % returns data on edge function set_vertex_data(handle, vertex, vdata) % sets data on vertex function set_edge_data(handle, edge, edata) % sets data on edge
In addition, we provde an 'add_task' function which allows for dynamic scheduling.
function add_task(handle, vertex, 'function name', priority)
Whenever the vertex data type / edge data type / or the update function changes, it is necessary to recompile the code.
The function compile_update_function in the graphlab/matlab directory is used to compile the code. (Do not move the compile_update_function.m file from that location! It uses its location to identify the directory for the GraphLab header files). help compile_update_function provides details on arguments and the compilation process.
compile_update_function is an m file which compiles a collection of update functions into a graphlab program.
compile_update_function(UPDATE_FNS, EX_VERTEX, EX_EDGE,
GLLIB_DIR, WORKING_DIR, OUT_NAME,
OPT_LEVEL = 0, PAR_COMPILES = 4,
NO_TCMALLOC = 0);
This function is the wrapper around the Matlab -> GraphLab compilation
process. It will use emlc to compile a collection of update function
names as listed as strings in UPDATE_FNS into binary executables in
the WORKING_DIR.
The script will poroduce a Makefile in the WORKING_DIR which will compile the following 3 binaries (OUT_NAME)_save_graph, (OUT_NAME)_binary and (OUT_NAME)_load_graph.
(OUT_NAME)_save_graph(VDATA, ADJ, EDATA, ...
OPTIONS, GRAPHFILE, STRICT)
This function will serialize the graph described by VDATA, ADJ, EDATA
as well as OPTIONS.initial_schedule to the file GRAPHFILE. If STRICT
is set stricter type checking will be used.
OPTIONS.initial_schedule = [SCHED1, SCHED2 ...]
A struct array describing the initial schedule where each
sched entry has the following format:
For instance, if I have two update functions 'u1' and 'u2' and I would like u1 to update vertices 1:100 with high priority, and u2 to update vertices 101:200 with lower priority. I could fill in the OPTIONS:
OPTIONS.initial_schedule(1).update_function = 'u1'; OPTIONS.initial_schedule(1).vertices = 1:100; OPTIONS.initial_schedule(1).priority = ones(1,100); OPTIONS.initial_schedule(2).update_function = 'u2'; OPTIONS.initial_schedule(2).vertices = 101:200; OPTIONS.initial_schedule(2).priority = 0.1 * ones(1,100);
This program takes in the standard GraphLab command line options, (run (OUT_NAME)_binary --help for details), as well two additional options.
--ingraphfile=[INGRAPH] --outgraphfile=[OUTGRAPH]
INGRAPH will read the GRAPHFILE generated by (OUTPUT_M_NAME)_save_graph and run GraphLab on it, using the parameters specified on the command line, as well as the initial schedule passed into the OPTIONS parameter of (OUTPUT_M_NAME)_save_graph. When complete the result will be stored in OUTGRAPH which can be read be (OUTPUT_M_NAME)_load_graph.
[VDATA, ADJ, EDATA] = (OUT_NAME)_load_graph(GRAPHFILE)
This function will deserialize the graph in GRAPHFILE and return the result in VDATA, ADJ, EDATA. See the documentation for (OUT_NAME)_load_graph for the format of VDATA, ADJ, EDATA.
Finally we also output a matlab script with the name OUT_NAME.m.
This is an M file generated in the working directory which automates the calling of save_graph, the binary and the load_graph functions.
The interface is:
[NEWVDATA, NEWADJ, NEWEDATA] = OUT_NAME(VDATA, ADJ, EDATA, ...
OPTIONS, STRICT)
The script first calls (OUT_NAME)_save_graph with the arguments VDATA, ADJ, EDATA, OPTIONS.initial_schedule and STRUCT. The graph file name GRAPHFILE is automatically generated. The current directory must be writeable. Next, it will invoke the program (OUT_NAME)_binary passing it the rest of the parameters in OPTIONS. Finally the output of the binary is read back using (OUT_NAME)_load_graph and returned.
1) Type limitations: The Matlab to GraphLab compiler does not yet support
2) Pseudo-Random numbers:
EMLC does provide random number generation through the family of rand() functions. However, they are not built to be thread-safe, so use at your own risk. Later improvements to the GraphLab to Matlab compiler will provide alternate substitutes for rand() which are thread safe. Even so, many typical Matlab functions should work with minimal modifications.
3) GraphLab's Shared Variables and Sync Operation are not supported
Future versions of the Matlab to GraphLab compiler may provide this. This is low priority at the moment.