magnify
Home Overview The Abstraction

GraphLab is Graph-Parallel

Designing and implementing efficient, bug free parallel and distributed algorithms can be very challenging. To address this challenge high-level data-parallel abstractions like Map-Reduce expose a simple computational pattern that isolates users form the complexities of large-scale parallel and distribute system design. Unfortunately, many important computational tasks are not inherently data-parallel and cannot be efficiently or intuitively expressed in data-parallel abstractions.

GraphLab is a high-level graph-parallel abstraction that efficiently and intuitively expresses computational dependencies. Unlike Map-Reduce where computation is applied to independent records, computation in GraphLab is applied to dependent records which are stored as vertices in a large distributed data-graph. Computation in GraphLab is expressed as a vertex-programs which are executed in parallel on each vertex and can interact with neighboring vertices. In contrast to the more general message passing and actor models, GraphLab constrains the interaction of vertex-programs to the graph structure enabling a wide range of system optimizations. GraphLab programs interact by directly reading the state of neighboring vertices and by modifying the state of adjacent edges. In addition, vertex-programs can signal neighboring vertex-programs causing them to be rerun at some point in the future.

PageRank Example

The easiest way to explain the GraphLab abstraction is through an example.  The PageRank algorithm is a popular technique for analyzing the relative “importance” of webpages based on the hyperlink structure.  At a high-level the PageRank algorithm captures the modeling assumption that a link to a webpage is a vote of confidence and that votes from important pages are more valuable.  The PageRank of the webpage i is given by the recurrence equation:

where α is the random reset probability and N is the number of webpages.  Because the PageRank of page i depends on the PageRank of the pages that link to page i, we iteratively apply the equation until the PageRank of each page converges.  We can express this algorithm as the following vertex program:

The above vertex-program is the basis of the original GraphLab1 abstraction.  Notice that the vertex program can directly access the neighboring vertices:

Using the resulting sum the program directly updates the value of its vertex:

Finally, using one of the more powerful features of the GraphLab abstraction, the vertex-program only triggers its neighboring vertices to recompute their value when the PageRank has changed significantly.

The signal command tells the GraphLab computational engine to run the vertex-program on the neighboring vertices.

GraphLab2: Cooking with GAS

The original GraphLab1 abstraction encoded the vertex-program as a single function in the underlying language (C++/Java).  This made programming simple but limited the potential parallelism as well as the ability for the GraphLab runtime to apply some  optimizations needed to really tackle massive problems.  For example, in many real world applications a small set of vertices with have enormous neighborhoods (e.g., celebrities in a social network). These high degree vertices lead to performance bottlenecks since their vertex programs are executed sequentially on a single machine. Even worse, because the GraphLab engine cannot choose the order in which neighbors are touched or move parts of the vertex-program to the machines that store the data, certain critical optimizations are not possible.

Through our  research in graph-parallel algorithms we discovered a common pattern.  The vast majority of vertex-programs can be further decomposed into three phases: Gather, Apply, and Scatter (GAS). During the gather phase the vertex-program typically collects information about its neighborhoods.  More importantly, the the computation in the gather phase typically resembles a micro map-reduce job over the neighborhood of the vertex.  The result of the gather phase is then passed on to the apply phase in which the vertex program updates the value of the vertex.  Finally, during the scatter phase the vertex program typically signals adjacent vertices.

The GraphLab2 abstraction refines the GraphLab1 abstraction by exploiting the GAS decomposition and requiring the user to explicitly break their program into gather, apply, and scatter functions.   This allows GraphLab to parallelize and distribute the gather and scatter phases and employ sophisticated new techniques for data layout and caching.

PageRank in GraphLab2:

We can easily decompose the PageRank algorithm into the Gather, Apply, and Scatter phases:

Actual C++ Code: