Fetching content
Pull the exact bytes Common Crawl captured for a URL, as text, Markdown, links, or the raw HTTP response.
Once you know a capture exists, you want what is in it. ccrawl has two commands
for that: get for the common case of a URL, and fetch for when you already
hold a record location.
get: the curl for Common Crawl
ccrawl get <url> looks the URL up in the index, fetches the single WARC
record with a byte-range request, and extracts what you ask for. One command,
one round trip to the data:
ccrawl get example.com --text # readable plain text
ccrawl get example.com --markdown # the page as Markdown
ccrawl get example.com --links # outbound links
ccrawl get example.com --headers # the captured HTTP response headers
With no extraction flag you get the raw HTTP response body, exactly as Common
Crawl stored it. Pick a crawl with -c just like search:
ccrawl get example.com --text -c 2024-51
Because a WARC record is its own gzip member, get downloads only that record,
not the file around it. That is what makes it feel like fetching a live page
rather than mining an archive.
fetch: when you have a location
ccrawl fetch reads records by explicit location. Point it at a record with
flags, or stream a list of locations on stdin with -. This is the other half
of the pipelines that search and table locations start:
# fetch records named on stdin (JSONL with filename/offset/length)
ccrawl search 'example.com/*' -o jsonl | ccrawl fetch -
# the same, written one file per record
ccrawl search 'example.com/*' -o jsonl | ccrawl fetch - --dir --out-dir out/
# a single record by exact location
ccrawl fetch --file crawl-data/.../CC-MAIN-...warc.gz --offset 698683535 --length 1262
fetch takes the same content flags as get, so you can transform on the way
through:
ccrawl search 'example.com/*' -o jsonl | ccrawl fetch - --markdown
ccrawl search 'example.com/*' -o jsonl | ccrawl fetch - --links -o url
extract: content from a page
When you only want one piece of a captured page, ccrawl extract is a thin
shortcut over get:
ccrawl extract title example.com # just the <title>
ccrawl extract text example.com # readable text
ccrawl extract markdown example.com # Markdown
ccrawl extract links example.com # outbound links
Picking the right tool
- You have a URL and want its content:
ccrawl get. - You have a list of locations (from
searchortable locations):ccrawl fetch -. - You want a whole archive file, not single records: see bulk and archives.