I don't know how people can use the data. There's so much of it! I don't see any...

Xeoncross · on July 23, 2022

You don't need to download the whole thing. You can parse the WARC files from S3 to only extract the information you want (like pages with content). It's a lot smaller when you only keep the links and text.

francoismassot · on July 23, 2022

A search index is often made of smaller independent pieces often called segments. So you can download & process progressively the data locally and upload it to an object storage. And run queries on it. That's what we did here for this project: https://quickwit.io/blog/commoncrawl

Also an interesting blog post here: https://fulmicoton.com/posts/commoncrawl/

marginalia_nu · on July 24, 2022

Huh, that's pretty expensive. For comparison, my search engine has an operational cost of ~$50/mo.