Bulk and archives
List, download, parse, and convert whole WARC, WAT, and WET files for a crawl.
Single-record fetching covers most questions. When you want volume, you work with whole archive files: list their paths, download them, parse them locally, and convert them into something your pipeline can read.
Listing paths
ccrawl paths <kind> prints the file paths for a crawl, one per line, ready to
pipe into download. The kinds are warc, wat, wet, robotstxt,
non200responses, cc-index, cc-index-table, and segment:
ccrawl paths warc -c 2024-51 # every WARC path for the crawl
ccrawl paths wet -n 1 # the first WET path
ccrawl paths robotstxt -o url # full https URLs, not relative paths
Downloading files
ccrawl download <kind> pulls whole files from the crawl manifest,
concurrently and resumably. Pass - instead of a kind to read an explicit list
of paths on stdin:
ccrawl download warc -n 5 # first 5 WARC files of the crawl
ccrawl download robotstxt -n 1 # one robots.txt archive
ccrawl paths wet -n 100 | ccrawl download - # download a chosen list
Files land under <data-dir>/raw in the source directory tree by default; use
--out to choose a directory and --flat to drop the tree. Concurrency is set
by --workers (-j).
Parsing a local archive
ccrawl parse <file> decodes a WARC, WAT, or WET file into structured records.
The format is detected from the file name and can be forced with --format:
ccrawl parse file.warc.gz -o jsonl
ccrawl parse file.warc.gz --type response --status 200 -o table -n 20
ccrawl parse file.warc.wet.gz --lang eng -o jsonl
The same content flags as get apply, so you can extract while you parse:
ccrawl parse file.warc.gz --links -o url # every outbound link in the file
ccrawl parse file.warc.gz --text -o jsonl # text of every response
ccrawl parse file.warc.gz --markdown -o jsonl # Markdown of every response
parse reads stdin with -, so it pairs with download for a stream that
never hits disk:
ccrawl paths wet -n 1 | ccrawl download - --out - | ccrawl parse - --lang eng
Converting to Parquet or JSONL
ccrawl convert turns archive files into columnar Parquet or line-delimited
JSON for your own tooling. It defaults to Parquet:
ccrawl convert file.warc.gz --to parquet -O out.parquet
ccrawl convert file.warc.wet.gz --to jsonl
ccrawl convert ./warc/ --to parquet --out ./parquet # a whole directory
Add --markdown to convert HTML response bodies to Markdown as they are
written. Parquet output makes a crawl slice queryable in DuckDB, Athena, or
Spark without any further plumbing.
Building a dataset library
For volume work you want the files in one tidy, browsable place rather than
scattered across ad-hoc --out dirs. The --library flag gives the archive
files a home and extends the same four commands to list, download, and process
them in place. The library lives under ~/notes/ccrawl by default; override it
with --library-dir or CCRAWL_LIBRARY. It is a separate tree from the data
dir, so clearing scratch state never touches the corpus you keep.
ccrawl download wet -n 20 --library -c 2024-51 # pull 20 WET files into the library
ccrawl paths wet --library -c 2024-51 # list the WET files you have locally
ccrawl parse wet --library --lang eng -o jsonl # decode every local WET file, eng only
ccrawl convert wet --library --to parquet # write parquet beside the raw files
In library mode the argument is a kind, not a file. Raw archives land under
<crawl>/<kind>/ and processed output under <crawl>/<format>/<kind>/, so a
directory listing is the index:
~/notes/ccrawl/CC-MAIN-2024-51/
wet/ raw WET archives
parquet/wet/ the same files as Parquet
jsonl/wet/ and as JSONL
A few things worth knowing:
downloadskips files already present, so re-running only fetches what is missing and the corpus grows incrementally.paths --librarylists what you have on disk for a kind, the local mirror of the remote manifest.parse --librarystreams every archive of a kind through one output, so a global-ncaps the whole run rather than each file.convert --librarypicks the destination subtree from--to(parquet/jsonl); pass-O/--outto send it elsewhere.
ccrawl config show prints the active library_dir so you can confirm where a
run reads and writes.
Counting files
ccrawl stats shows the shape of a crawl: how many files of each kind it ships.
ccrawl stats -c 2024-51
For row-level questions over a whole crawl, reach for the columnar index rather than downloading and parsing everything.