Datasets
To help encourage the exploration of graph data we have assembled a collection of exciting public domain graph datasets.
| Image | Dataset name | Dataset size | GraphLab Algorithm | Download instructions | Credit |
|---|---|---|---|---|---|
| Yahoo! KDD CUP 2011 – music rating | 1M users, 600K songs, 260M ratings | Matrix factorization | Instructions | Yahoo! KDD CUP | |
![]() |
Twitter social graph | 8K x 8K twitter user, 62M links | Matrix factorization with sparse factors | Instructions | Timmy Wilson, smarttypes.org |
![]() |
Netflix – collaborative filtering (subset) | 1M x 17K, 3M nnz | Alternating least squares | Due to copyright, Netflix data is not available for download. Instead, we provide a small synthetic sample with Netflix like properties.We provide a Netflix like synthetic example with running instructions here. | Netflix |
![]() |
NPIC 500 Dataset (Natural Language Processing dataset). | 88K Noun phrases, 99K contexts, 20M occurrences | SVD | 1. Download dataset from here.
2. Extract the tgz file using: “tar xvzf all-pairs-t500-matrix-data-code.tar.gz” 3. Find the file matrix.txt, and add the following two lines at the top: %%MatrixMarket matrix coordinate real general 88322 99400 20597287 4. [GraphLab version 2] Run SVD using: ./pmf matrix.txt 13 –ncpus=8 –matrixmarket=true –max_iter=10 |
Tom Mitchell, CMU |
Bigger Datasets (above half a billion non-zeros)
| Image | Dataset name | Dataset size | GraphLab Algorithm | Download instructions | Credit |
|---|---|---|---|---|---|
![]() |
Wikipedia term occurrences dataset | 4.3M terms, 3.3M documents, 513M occurrences | SVD | Download thefile medwiki.gz | Contributed by Andrew Onley, The University of Memphis. |
![]() |
Wikipedia term occurrences dataset | 40K terms, 10M documents, 689M occurrences | SVD | Download the file bigwiki.gz | Contributed by Jamie Callan, Brian Murphy, and Partha Talukdar, CMU. |
![]() |
Wikipedia term occurrences dataset | 40K terms, 50M documents, 3.3G occurrences | SVD | Download the file hugewiki.gz | Contributed by Jamie Callan, Brian Murphy, and Partha Talukdar, CMU. |
![]() |
Mouse Visual Cortex | 26K x 21K image (572M non-zeros) | Spectral Clustering | Original data here. Download matrix market format file: mouse_brain from here. | Contributed by Joshua Vogelstein, OpenConnectToMe Project, Johns Hopkins University. |
![]() |
Twitter graph | 41M nodes, 1.4 billion edges | K-cores | Download instructions | Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue Moon |










And they would be located where…?