It is, it's libgen + commoncrawl + wikidump + a bunch of other datasets. OpenAI ... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		dirheist on Feb 1, 2023 \| parent \| context \| favorite \| on: ChatGPT Plus It is, it's libgen + commoncrawl + wikidump + a bunch of other datasets. OpenAI claim that commoncrawl is roughly 60% of its total training corpus and they also claim they use the other datasets listed. They probably also have some sort of proprietary Q&A/search query corpus via Microsoft.

humanistbot on Feb 1, 2023 [–]

> It is, it's libgen + commoncrawl + wikidump + a bunch of other datasets.

I'm having trouble finding a source for the libgen claim. Is that confirmed or just rumor?

mandmandam on Feb 1, 2023 | [–]

The ChatGPT Prompt book by LifeArchitect.ai is where I saw it: https://docs.google.com/presentation/d/17b_ocq-GL5lhV_bYSShz...

dblitt on Feb 2, 2023 | | [–]

> Informed 'best guess' only. > Sources: https://lifearchitect.ai/papers/

Doesn't seem too convincing to me

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact