If I want to fine-tune this to some text data, are there obvious constraints to be aware of? I've got a reasonable amount of text (~50-100G) but seeing that there's a json file created makes me think that's probably too much. gpt-2-simple seems to describe 100M as 'massive' so what's a reasonable amount to aim for?
Or should I be training from scratch? (edit - looking into training from scratch since I don't have thousands to throw at this I'm guessing that's a 'no')
Oh, that's less than I was expecting - I'm used to having significantly less data to play with than the major entities. I guess I do but in this case a pretty reasonable amount of data was enough for very impressive results.
> I'm not 100% sure you can encode and store that much data in memory with the current implementation, even with the fast tokenizers.
That makes sense. I wasn't too sure what sensible sizes would be, there's probably some interesting subsets of the data I could take though and use for fine tuning (or some sampled data) - maybe down to 100M as that sounded like a large-but-ok amount to use.
I'm looking forward to seeing what I can get out of this, thanks for making something simple enough that I can do that for a "I wonder if" kind of problem!
If I want to fine-tune this to some text data, are there obvious constraints to be aware of? I've got a reasonable amount of text (~50-100G) but seeing that there's a json file created makes me think that's probably too much. gpt-2-simple seems to describe 100M as 'massive' so what's a reasonable amount to aim for?
Or should I be training from scratch? (edit - looking into training from scratch since I don't have thousands to throw at this I'm guessing that's a 'no')