Most of these unauthenticated requests are read-only. All of public github is on...

yorwba · 2025-05-15T04:49:55 1747284595

I guess you're getting the size from the Arctic Code Vault? https://github.blog/news-insights/company-news/github-archiv... That was 5 years ago and is presumably in git's compressed storage format. Caching the corresponding GitHub HTML would take significantly more.

TheDong · 2025-05-15T09:24:27 1747301067

You're talking about the 21TB captured to the arctic code vault, but that 21TB isn't "all of public github"

Quoting from https://archiveprogram.github.com/arctic-vault/

> every *active* public GitHub repository. [active meaning any] repo with any commits between [2019-11-13 and 2020-02-02 ...] The snapshot consists of the HEAD of the default branch of each repository, minus any binaries larger than 100KB in size

So no files larger than 100KB, no commit history, no issues or PR data, no other git metadata.

If we look at this blog post from 2022, the number we get is 18.6 PB for just git data https://github.blog/engineering/architecture-optimization/sc...

Admittedly, that includes private repositories too, and there's no public number for just public repositories, but I'm certain it's at least a noticeable fraction of that ~19PB.

kvemkon · 2025-05-15T11:56:59 1747310219

> At GitHub, we store a lot of Git data: more than 18.6 petabytes of it, to be precise.

About $ 250 000 for 1000 HDDs and you get all the data. Meaning private persons such as top FAANG engineers could get a copy of the whole data after 2-3 years job. For companies dealing with AI such raw price is nothing at all.