magnify
Home Download Datasets

Datasets

To help encourage the exploration of graph data we have assembled a collection of exciting public domain graph datasets.

Image Dataset name Dataset size GraphLab Algorithm Download instructions Credit
Yahoo! KDD CUP 2011 – music rating 1M users, 600K songs, 260M ratings Matrix factorization Instructions Yahoo! KDD CUP
Twitter social graph 8K x 8K twitter user, 62M links Matrix factorization with sparse factors Instructions Timmy Wilson, smarttypes.org
Netflix – collaborative filtering (subset) 1M x 17K, 3M nnz Alternating least squares Due to copyright, Netflix data is not available for download.  Instead, we provide a small synthetic sample with Netflix like properties.We provide a Netflix like synthetic example with running instructions here. Netflix
NPIC 500 Dataset (Natural Language Processing dataset). 88K Noun phrases, 99K contexts, 20M occurrences SVD 1. Download dataset from here

2. Extract the tgz file using: “tar xvzf all-pairs-t500-matrix-data-code.tar.gz”

3. Find the file matrix.txt, and add the following two lines at the top:

%%MatrixMarket matrix coordinate real general

88322 99400 20597287

4. [GraphLab version 2] Run SVD using: ./pmf matrix.txt 13 –ncpus=8 –matrixmarket=true –max_iter=10

Tom Mitchell, CMU

Bigger Datasets (above half a billion non-zeros)

Image Dataset name Dataset size GraphLab Algorithm Download instructions Credit
Wikipedia term occurrences dataset 4.3M terms, 3.3M documents, 513M occurrences SVD Download thefile medwiki.gz Contributed by Andrew Onley, The University of Memphis.
Wikipedia term occurrences dataset 40K terms, 10M documents, 689M occurrences SVD Download the file bigwiki.gz Contributed by Jamie Callan, Brian Murphy, and Partha Talukdar, CMU.
Wikipedia term occurrences dataset 40K terms, 50M documents, 3.3G occurrences SVD Download the file hugewiki.gz Contributed by Jamie Callan, Brian Murphy, and Partha Talukdar, CMU.
Mouse Visual Cortex 26K x 21K image (572M non-zeros) Spectral Clustering Original data here. Download matrix market format file: mouse_brain from here. Contributed by Joshua Vogelstein, OpenConnectToMe Project, Johns Hopkins University.
Twitter graph 41M nodes, 1.4 billion edges K-cores Download instructions Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue Moon

One Response

  1. Anonymous Coward

    And they would be located where…?