Guides

Task-oriented walkthroughs for the things people actually do with Common Crawl.

Each guide is built around a job rather than a command: finding pages, fetching their content, working with whole archives, querying the columnar index, building a local dataset, looking up ranks, and scanning the news feed. They assume you have run the quick start.

Finding pages Query the URL index for captures of a URL or a path pattern, and filter the results. Fetching content Pull the exact bytes Common Crawl captured for a URL, as text, Markdown, links, or the raw HTTP response. Bulk and archives List, download, parse, and convert whole WARC, WAT, and WET files for a crawl. The columnar index Answer dataset-wide questions over the Parquet copy of the URL index with DuckDB or Athena. Building a dataset Load a slice of Common Crawl into a local DuckDB database and query it offline. Host and domain ranks Look up harmonic-centrality and PageRank positions from the Common Crawl web graph. Scanning the news Work with the continuous CC-NEWS dataset, which has no URL index.