CLI

Every command and subcommand, with the flags that matter.

ccrawl <command> [subcommand] [flags]

Run ccrawl <command> --help for the full flag list on any command. This page is the map.

Commands

Command	What it does
`crawls`	List, resolve, and inspect the monthly crawls
`search`	Query the URL index (CDX) for captures of a URL or pattern
`get`	Fetch what Common Crawl captured for a URL (the curl for Common Crawl)
`fetch`	Retrieve WARC records by explicit location, or from stdin
`download`	Download whole archive files for a crawl
`paths`	List the archive file paths for a crawl
`parse`	Decode a local WARC/WAT/WET file into records
`extract`	Pull text, links, title, or Markdown from a captured page
`news`	Work with the continuous CC-NEWS dataset
`table`	Query the columnar Parquet index (alias `columnar`, `athena`)
`rank`	Look up host and domain ranks from the web graph
`db`	Build and query a local DuckDB database
`convert`	Convert WARC/WAT/WET archives to Parquet or JSONL
`stats`	Show the shape of a crawl: file counts per archive kind
`config`	Show resolved configuration and data paths
`cache`	Inspect and clear the on-disk cache
`version`	Print the version and exit

crawls

Subcommand	Does
`crawls list`	List the monthly crawls, newest first
`crawls latest`	Print the newest crawl ID
`crawls resolve <ref>`	Resolve a year or `latest` to a crawl ID
`crawls info <id>`	File counts per archive kind for a crawl

search

ccrawl search <url|pattern> [flags]

A trailing /* matches everything under a path. Filters: --mime, --status. Shaping: --fields, --template, -o, -n. Aliases: index, cdx.

get

ccrawl get <url> [flags]

Content flags (pick one): --text, --markdown, --links, --headers. With none, prints the raw HTTP response body.

fetch

ccrawl fetch [-] [flags]

Locate a record with --file, --offset, --length, or stream JSONL locations on stdin with -. Content flags: --body (default), --text, --markdown, --links, --headers, --meta. Write one file per record with --dir and --out-dir.

download

ccrawl download <kind|-> [flags]

Kinds: warc, wat, wet, robotstxt, non200responses, cc-index, cc-index-table. Use - to read paths on stdin. --out sets the directory, --flat drops the source tree, -j/--workers sets concurrency. --library files the archives under <library>/<crawl>/<kind>/ instead of the data dir (see bulk and archives).

paths

ccrawl paths <kind> [flags]

Kinds: warc, wat, wet, robotstxt, non200responses, cc-index, cc-index-table, segment. --kinds lists them. -o url prints full URLs. --library lists the files of a kind already in the library instead of the remote manifest.

parse

ccrawl parse <file|-> [flags]

Force the format with --format (warc|wat|wet). Filters: --type, --status, --mime, --lang, --url. Content: --links, --text, --markdown, --meta. With --library the argument is a kind and parse decodes every local archive of that kind through one output.

extract

Subcommand	Does
`extract title <url>`	The page title
`extract text <url>`	Readable plain text
`extract markdown <url>`	HTML converted to Markdown
`extract links <url>`	Outbound links

news

Subcommand	Does
`news list`	List CC-NEWS files for `--year`/`--month`
`news download`	Download CC-NEWS files
`news search <host>`	Stream and match a host (no index)

table

Aliases: columnar, athena.

Subcommand	Does
`table urls`	Matching URLs
`table locations`	Record locations, ready for `fetch`
`table count`	Count of matching captures
`table langs`	Breakdown by content language
`table mimes`	Breakdown by MIME type
`table sql`	Build the SQL from the filter flags and print it
`table query <sql>`	Run raw SQL (`ccindex` is the source)
`table schema`	The columns of the index

Filters across the subcommands: --domain, --host, --tld, --mime, --status, --lang, --path-prefix, --subset. Engine: --engine (auto|duckdb|print), --print.

rank

Subcommand	Does
`rank domain <domain>`	Rank of a registered domain
`rank host <host>`	Rank of a host
`rank top`	Top-ranked hosts or domains

Always pass --table <url> (the gzipped rank table). top takes --tld.

db

Subcommand	Does
`db load`	Load matching index records into local DuckDB
`db sql <query>`	Run SQL against the local database
`db shell`	Open an interactive DuckDB shell
`db path`	Print the database file path

db load takes the same filter flags as table.

convert

ccrawl convert <file|dir> [flags]

--to parquet|jsonl (default parquet). -O/--out sets the output file or directory. --markdown converts HTML bodies on the way. With --library the argument is a kind: convert reads every local archive of that kind and writes the output under <crawl>/<format>/<kind>/.

Command	Does
`stats`	File counts per archive kind for a crawl
`config show`	Resolved configuration and data paths
`cache info`	Cache size and entry count
`cache dir`	Print the cache directory
`cache clear`	Remove every cached entry
`version`	Print the version and exit

Global flags

These apply to most commands. See configuration for the full list and their defaults.

Flag	Meaning
`-c, --crawl`	Crawl ID, year, or `latest`/`all` (default `latest`)
`-o, --output`	Output format (default auto)
`-n, --limit`	Maximum results (`0` means unlimited)
`-j, --workers`	Concurrency for downloads and scans
`--source`	Bulk data source: `https` or `s3`
`--rate`	Minimum delay between requests
`--no-cache`	Bypass the on-disk cache
`--library`	Read and write under the dataset library
`--library-dir`	Library root (default `~/notes/ccrawl`)
`--data-dir`	Root data directory