Skip to content
ccrawl

Configuration

The data directory, environment variables, and global flags, with their defaults.

ccrawl needs almost no configuration. There is no config file; every option is a flag or an environment variable, and the defaults are chosen so the common case needs neither.

The data directory

ccrawl keeps all of its state under one tree, ~/data/ccrawl by default: the on-disk cache, downloaded archives, converted Parquet, and the local DuckDB file. See the resolved paths any time:

ccrawl config show
data_dir     ~/data/ccrawl
cache_dir    ~/data/ccrawl/cache
raw_dir      ~/data/ccrawl/raw
parquet_dir  ~/data/ccrawl/parquet
db_path      ~/data/ccrawl/ccrawl.duckdb

Point the whole tree somewhere else with CCRAWL_DATA_DIR, or per-command with --data-dir.

The dataset library

The --library flag (see bulk and archives) reads and writes a curated corpus of archive files in a tree of its own, separate from the data dir so scratch state and the files you keep never mix. It defaults to ~/notes/ccrawl and reports as library_dir in ccrawl config show:

library_dir  ~/notes/ccrawl

Move it with CCRAWL_LIBRARY or per-command with --library-dir. Inside it, raw archives live under <crawl>/<kind>/ and processed output under <crawl>/<format>/<kind>/.

Environment variables

Variable Used for
CCRAWL_DATA_DIR Root data directory (overrides the default ~/data/ccrawl)
CCRAWL_LIBRARY Dataset library root (overrides the default ~/notes/ccrawl)
CCRAWL_CACHE_DIR Cache directory (overrides the default under the data dir)

Global flags

Flag Default Meaning
-c, --crawl latest Crawl ID, a year, or latest/all
-o, --output auto table, json, jsonl, csv, tsv, url, raw
-n, --limit 0 Maximum results; 0 is unlimited
-j, --workers per command Concurrency for downloads and scans
--source https Bulk data source: https or s3
--rate 200ms Minimum delay between requests, to stay polite
--retries 5 Retry attempts on 429 and 5xx
--timeout 2m Per-request timeout
--no-cache off Bypass the on-disk cache for this run
--data-dir ~/data/ccrawl Root data directory
--library off Read and write under the dataset library
--library-dir ~/notes/ccrawl Dataset library root
--fields all Comma-separated columns to show
--template none Go text/template applied per row
--no-header off Omit the header row in table/csv output
--color auto auto, always, or never
-q, --quiet off Suppress progress output
-v, --verbose off Increase verbosity (repeatable)
--dry-run off Print actions without performing them

Output auto-detection

The default output format adapts to where it is going: an aligned table when the output is a terminal, JSONL when it is piped. That keeps interactive use readable and scripted use parseable without you setting -o either time. See output formats for the full set.

Caching and politeness

ccrawl caches small index responses and manifests on disk so repeated commands do not re-fetch them. --rate keeps a minimum gap between requests so a busy session stays a good citizen against the public data. cache info, cache dir, and cache clear manage the cache.