Interestingly they seem to have different token ids for "Word", "word", " Word" and " word". That seems kind of a wasteful design.
It seems like it would make more sense to have a single token for all variants and then a "capitalized where not expected" token (e.g. "foo Foo"), a "not capitalized where expected" token (e.g. "foo. foo") and a "missing space where expected" token (e.g. "foo.Foo").
The lack of any normalization also means that WrItInG tExT lIkE tHiS will make future GPT versions not be able to make full use of the text during future training unless they change the tokenization (or the model is so overpowered that it doesn't matter).
The tokenization is a statistical product of the frequency of byte sequences in the training corpus. It might seem unintuitive but I wouldn't go so far as to say it's "wasteful". It may very well be but frankly you'd have to have a good explanation for why byte pair encoding is so much more successful than other tokenization schemes.
> why byte pair encoding is so much more successful than other tokenization schemes.
what's the evidence for that please? just asking because i dont know, not because i disagree. ive read a bunch of BPE explainers but nobody has bothered to explain why or how we landed on BPE
I'm not an AI expert, so I don't know what research has been done to verify it, but this comment below, https://news.ycombinator.com/item?id=35454839 , helped me understand it, and intuitively I think it makes sense.
That is, byte pair encoding tokenization is itself based on how common it is to see particular characters in sequential order in the training data. Thus, if the training data really frequently sees characters together (as, of course, it does in common words), then these words get a single token. Which, given how an LLM works, really makes sense because it looks for statistical relationships among strings of tokens. Thus, the way I think of it is that byte pair encoding is essentially like a pre-processing step that already optimizes for statistical relationships among individual characters.
The actual tokenizer often does not matter since you can add pre processors/normalizers. I assume they did it like this because capitalization matters in a lot of contexts
Similarly, pre-processing can be harmful. I think there are reasonable predictive differences when predicting the next-word follow up to a sentence that's properly capitalized versus one that's all lowercase. Not only will the "all lowercase" convention likely prevail in forward predictions, it also indicates something about the context of the writing, the author, their sense of style.
It's hard to argue that this information isn't (a) being captured by GPTs and (b) important. If you just threw it away, GPTs would have less information available to absorb.
A good example is the initially released BERT-multilingual-uncased model back from the first BERT paper, which (without even mentioning it anywhere) not only collapsed the case but also removed diacritic marks from latin characters, thus killing its performance on those languages which heavily rely on them.
The model is indeed so overpowered that it doesn’t matter in practice. See the Sentencepiece paper for some discussion of the design decisions on stuff like whitespace.
It seems like it would make more sense to have a single token for all variants and then a "capitalized where not expected" token (e.g. "foo Foo"), a "not capitalized where expected" token (e.g. "foo. foo") and a "missing space where expected" token (e.g. "foo.Foo").
The lack of any normalization also means that WrItInG tExT lIkE tHiS will make future GPT versions not be able to make full use of the text during future training unless they change the tokenization (or the model is so overpowered that it doesn't matter).