Hacker News new | past | comments | ask | show | jobs | submit login

Most of these unauthenticated requests are read-only.

All of public github is only 21TB. Can't they just host that on a dumb cache and let the bots crawl to their heart's content?




I guess you're getting the size from the Arctic Code Vault? https://github.blog/news-insights/company-news/github-archiv... That was 5 years ago and is presumably in git's compressed storage format. Caching the corresponding GitHub HTML would take significantly more.


You're talking about the 21TB captured to the arctic code vault, but that 21TB isn't "all of public github"

Quoting from https://archiveprogram.github.com/arctic-vault/

> every *active* public GitHub repository. [active meaning any] repo with any commits between [2019-11-13 and 2020-02-02 ...] The snapshot consists of the HEAD of the default branch of each repository, minus any binaries larger than 100KB in size

So no files larger than 100KB, no commit history, no issues or PR data, no other git metadata.

If we look at this blog post from 2022, the number we get is 18.6 PB for just git data https://github.blog/engineering/architecture-optimization/sc...

Admittedly, that includes private repositories too, and there's no public number for just public repositories, but I'm certain it's at least a noticeable fraction of that ~19PB.


> At GitHub, we store a lot of Git data: more than 18.6 petabytes of it, to be precise.

About $ 250 000 for 1000 HDDs and you get all the data. Meaning private persons such as top FAANG engineers could get a copy of the whole data after 2-3 years job. For companies dealing with AI such raw price is nothing at all.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: