Guides
Task-oriented walkthroughs for the things people actually do with Common Crawl.
Each guide is built around a job rather than a command: finding pages, fetching their content, working with whole archives, querying the columnar index, building a local dataset, looking up ranks, and scanning the news feed. They assume you have run the quick start.
Finding pages
Query the URL index for captures of a URL or a path pattern, and filter the results.
Fetching content
Pull the exact bytes Common Crawl captured for a URL, as text, Markdown, links, or the raw HTTP response.
Bulk and archives
List, download, parse, and convert whole WARC, WAT, and WET files for a crawl.
The columnar index
Answer dataset-wide questions over the Parquet copy of the URL index with DuckDB or Athena.
Building a dataset
Load a slice of Common Crawl into a local DuckDB database and query it offline.
Host and domain ranks
Look up harmonic-centrality and PageRank positions from the Common Crawl web graph.
Scanning the news
Work with the continuous CC-NEWS dataset, which has no URL index.