Configuration

The data directory, environment variables, and global flags, with their defaults.

ccrawl needs almost no configuration. There is no config file; every option is a flag or an environment variable, and the defaults are chosen so the common case needs neither.

The data directory

ccrawl keeps all of its state under one tree, ~/data/ccrawl by default: the on-disk cache, downloaded archives, converted Parquet, and the local DuckDB file. See the resolved paths any time:

ccrawl config show

data_dir     ~/data/ccrawl
cache_dir    ~/data/ccrawl/cache
raw_dir      ~/data/ccrawl/raw
parquet_dir  ~/data/ccrawl/parquet
db_path      ~/data/ccrawl/ccrawl.duckdb

Point the whole tree somewhere else with CCRAWL_DATA_DIR, or per-command with --data-dir.

The dataset library

The --library flag (see bulk and archives) reads and writes a curated corpus of archive files in a tree of its own, separate from the data dir so scratch state and the files you keep never mix. It defaults to ~/notes/ccrawl and reports as library_dir in ccrawl config show:

library_dir  ~/notes/ccrawl

Move it with CCRAWL_LIBRARY or per-command with --library-dir. Inside it, raw archives live under <crawl>/<kind>/ and processed output under <crawl>/<format>/<kind>/.

Environment variables

Variable	Used for
`CCRAWL_DATA_DIR`	Root data directory (overrides the default `~/data/ccrawl`)
`CCRAWL_LIBRARY`	Dataset library root (overrides the default `~/notes/ccrawl`)
`CCRAWL_CACHE_DIR`	Cache directory (overrides the default under the data dir)

Global flags

Flag	Default	Meaning
`-c, --crawl`	`latest`	Crawl ID, a year, or `latest`/`all`
`-o, --output`	auto	`table`, `json`, `jsonl`, `csv`, `tsv`, `url`, `raw`
`-n, --limit`	`0`	Maximum results; `0` is unlimited
`-j, --workers`	per command	Concurrency for downloads and scans
`--source`	`https`	Bulk data source: `https` or `s3`
`--rate`	`200ms`	Minimum delay between requests, to stay polite
`--retries`	`5`	Retry attempts on 429 and 5xx
`--timeout`	`2m`	Per-request timeout
`--no-cache`	off	Bypass the on-disk cache for this run
`--data-dir`	`~/data/ccrawl`	Root data directory
`--library`	off	Read and write under the dataset library
`--library-dir`	`~/notes/ccrawl`	Dataset library root
`--fields`	all	Comma-separated columns to show
`--template`	none	Go text/template applied per row
`--no-header`	off	Omit the header row in table/csv output
`--color`	auto	`auto`, `always`, or `never`
`-q, --quiet`	off	Suppress progress output
`-v, --verbose`	off	Increase verbosity (repeatable)
`--dry-run`	off	Print actions without performing them

Output auto-detection

The default output format adapts to where it is going: an aligned table when the output is a terminal, JSONL when it is piped. That keeps interactive use readable and scripted use parseable without you setting -o either time. See output formats for the full set.

Caching and politeness

ccrawl caches small index responses and manifests on disk so repeated commands do not re-fetch them. --rate keeps a minimum gap between requests so a busy session stays a good citizen against the public data. cache info, cache dir, and cache clear manage the cache.