It is, it's libgen + commoncrawl + wikidump + a bunch of other datasets. OpenAI claim that commoncrawl is roughly 60% of its total training corpus and they also claim they use the other datasets listed. They probably also have some sort of proprietary Q&A/search query corpus via Microsoft.