This looks great! If I want to fine-tune this to some text data, are there obvio...

minimaxir · on May 19, 2020

~50-100G isn't "some" text data. The original GPT-2 was trained on 40G of text data.

I'm not 100% sure you can encode and store that much data in memory with the current implementation, even with the fast tokenizers.

IanCal · on May 20, 2020

Oh, that's less than I was expecting - I'm used to having significantly less data to play with than the major entities. I guess I do but in this case a pretty reasonable amount of data was enough for very impressive results.

> I'm not 100% sure you can encode and store that much data in memory with the current implementation, even with the fast tokenizers.

That makes sense. I wasn't too sure what sensible sizes would be, there's probably some interesting subsets of the data I could take though and use for fine tuning (or some sampled data) - maybe down to 100M as that sounded like a large-but-ok amount to use.

I'm looking forward to seeing what I can get out of this, thanks for making something simple enough that I can do that for a "I wonder if" kind of problem!