I guess you're getting the size from the Arctic Code Vault? https://github.blog/news-insights/company-news/github-archiv... That was 5 years ago and is presumably in git's compressed storage format. Caching the corresponding GitHub HTML would take significantly more.
> every *active* public GitHub repository. [active meaning any] repo with any commits between [2019-11-13 and 2020-02-02 ...] The snapshot consists of the HEAD of the default branch of each repository, minus any binaries larger than 100KB in size
So no files larger than 100KB, no commit history, no issues or PR data, no other git metadata.
Admittedly, that includes private repositories too, and there's no public number for just public repositories, but I'm certain it's at least a noticeable fraction of that ~19PB.
> At GitHub, we store a lot of Git data: more than 18.6 petabytes of it, to be precise.
About $ 250 000 for 1000 HDDs and you get all the data. Meaning private persons such as top FAANG engineers could get a copy of the whole data after 2-3 years job. For companies dealing with AI such raw price is nothing at all.
All of public github is only 21TB. Can't they just host that on a dumb cache and let the bots crawl to their heart's content?