____ _ ____ _ _
/ ___|_ __ __ _ _ __ | |__ | _ \ __ _| |_ __ _ ___ ___| |_ ___
| | _| '__/ _` | '_ \| '_ \ | | | |/ _` | __/ _` / __|/ _ \ __/ __|
| |_| | | | (_| | |_) | | | | | |_| | (_| | || (_| \__ \ __/ |_\__ \
\____|_| \__,_| .__/|_| |_| |____/ \__,_|\__\__,_|___/\___|\__|___/
|_|
License: MIT Python 3.8+ Format: CSV
A collection of scripts to download and convert popular graph datasets into a unified CSV format for benchmarking graph databases and algorithms.
- π― Unified Format: All datasets converted to consistent
nodes.csv+edges.csvformat - π 30+ Datasets: From small test graphs to billion-edge networks
- π Easy to Use: Simple
makecommands to download and convert - π Multiple Sources: Support for MTX, OGB, Yelp, and more
- π Progress Tracking: Built-in progress bars for large downloads
- πΎ Smart Caching: Skip downloads if files already exist
# Build all datasets make # Build one or more specific datasets directly make ak2010 make ak2010 belgium_osm soc-LiveJournal1 # Other targets work the same way make fetch ak2010 belgium_osm # download only, no conversion make clean ak2010 # clean a specific dataset
All datasets are converted to a consistent, simple format for easy integration:
node_id
0
1
2
...src,dst 0,1 0,2 1,3 ...
src,dst,weight 0,1,0.5 0,2,1.0 ...
- β Contiguous 0-based node IDs: All node IDs are remapped to a contiguous sequence starting from 0
- β UTF-8 encoded: Universal compatibility
- β Header row: Column names in first line
- β Comma-delimited: Standard CSV format
- β
Optional properties: Extensible with additional columns (e.g.,
type,label)
Note: Node IDs are always remapped to a contiguous 0-based sequence [0, 1, 2, ..., N-1], regardless of the original IDs in the source dataset. This ensures consistent and efficient indexing across all datasets.
soc-LiveJournal1- LiveJournal social networksoc-orkut- Orkut social networksoc-twitter-2010- Twitter follower networksoc-sinaweibo- Sina Weibo social network
cit-Patents- Patent citation networkcoAuthorsDBLP- DBLP co-authorship network
roadNet-CA- California road networkroad_usa- USA road networkroad_central- Central USA road networkbelgium_osm- Belgium OpenStreetMapgermany_osm- Germany OpenStreetMapeurope_osm- Europe OpenStreetMapasia_osm- Asia OpenStreetMaposm-road-networks- Any city via OpenStreetMap (osmnx), with road attributes (lat/lon, speed, travel time, highway type, etc.)
uk-2002- UK web graph (2002)uk-2005- UK web graph (2005)arabic-2005- Arabic web graphindochina-2004- Indochina web graphwebbase-1M- WebBase crawl (1M nodes)webbase-2001- WebBase crawl (2001)
delaunay_n13- Delaunay triangulation (2^13 nodes)delaunay_n21- Delaunay triangulation (2^21 nodes)delaunay_n24- Delaunay triangulation (2^24 nodes)kron_g500-logn21- Kronecker graph
ogbn-products- Amazon product co-purchase network (OGB)yelp- Yelp user-business review network (bipartite)imdb- IMDB title-person bipartite network (movies, shows, cast & crew)movielens-small- MovieLens small rating dataset (~100K ratings)movielens- MovieLens full rating dataset (~33M ratings)ldbc-snb- LDBC Social Network Benchmark
hollywood-2009- Hollywood actor collaboration networkak2010- Autonomous systems graphgeolocation- Geolocation network
import pandas as pd # Load graph nodes = pd.read_csv('nodes.csv') edges = pd.read_csv('edges.csv') print(f"Nodes: {len(nodes)}, Edges: {len(edges)}") # For typed nodes (e.g., Yelp) if 'type' in nodes.columns: print(nodes['type'].value_counts())
#include <fstream> #include <sstream> #include <vector> struct Edge { int src, dst; }; std::vector<Edge> read_edges(const std::string& filename) { std::vector<Edge> edges; std::ifstream file(filename); std::string line; std::getline(file, line); // Skip header while (std::getline(file, line)) { std::istringstream iss(line); Edge e; char comma; iss >> e.src >> comma >> e.dst; edges.push_back(e); } return edges; }
mtx2csv.py- Convert Matrix Market (.mtx) to CSVogbn-products/- OGB dataset converteryelp/- Yelp dataset converterosm-road-networks/- OSM road network downloader (osmnx)preview_graph.py- Preview graph statistics
Download any city's road network using osmnx:
# Default city (Pasadena, CA) make osm-road-networks # Custom city make -C osm-road-networks PLACE="Beijing, China" make -C osm-road-networks PLACE="Tokyo, Japan"
Output per city (in a subdirectory named after the place):
osm-road-networks/
pasadena_california_usa/
nodes.csv # node_id, lat, lon
edges.csv # src, dst, length, speed_kph, travel_time, name, highway, oneway, maxspeed, lanes
beijing_china/
nodes.csv
edges.csv
Some datasets include additional node/edge properties:
Yelp (bipartite graph):
node_id,type,stars,review_count 0,business,4.0,12 150346,user,3.72,15 ...
Extensible format - add columns as needed:
src,dst,weight,timestamp,label 0,1,0.5,1609459200,friend
- β MTX (Matrix Market)
- β OGB (Open Graph Benchmark)
- β Yelp JSON
- β SNAP format
- β Custom formats
Contributions welcome! To add a new dataset:
- Create a subdirectory with dataset name
- Add a
Makefilewith download/conversion rules - Ensure output follows the unified CSV format
- Update this README
MIT License - see individual dataset sources for their respective licenses.
Note: Dataset sizes range from thousands to billions of edges. Check individual dataset directories for specific statistics and download requirements.