Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I don't know how people can use the data. There's so much of it! I don't see any harddrives that are 80TB. It seems like people would need some kind of raid setup that can handle 200+TB of uncompressed data


You don't need to download the whole thing. You can parse the WARC files from S3 to only extract the information you want (like pages with content). It's a lot smaller when you only keep the links and text.


A search index is often made of smaller independent pieces often called segments. So you can download & process progressively the data locally and upload it to an object storage. And run queries on it. That's what we did here for this project: https://quickwit.io/blog/commoncrawl

Also an interesting blog post here: https://fulmicoton.com/posts/commoncrawl/


Huh, that's pretty expensive. For comparison, my search engine has an operational cost of ~$50/mo.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: