Interestingly they seem to have different token ids for "Word", "word", " Word" ...

AbrahamParangi · on April 5, 2023

The tokenization is a statistical product of the frequency of byte sequences in the training corpus. It might seem unintuitive but I wouldn't go so far as to say it's "wasteful". It may very well be but frankly you'd have to have a good explanation for why byte pair encoding is so much more successful than other tokenization schemes.

swyx · on April 5, 2023

> why byte pair encoding is so much more successful than other tokenization schemes.

what's the evidence for that please? just asking because i dont know, not because i disagree. ive read a bunch of BPE explainers but nobody has bothered to explain why or how we landed on BPE

hn_throwaway_99 · on April 5, 2023

I'm not an AI expert, so I don't know what research has been done to verify it, but this comment below, https://news.ycombinator.com/item?id=35454839 , helped me understand it, and intuitively I think it makes sense.

That is, byte pair encoding tokenization is itself based on how common it is to see particular characters in sequential order in the training data. Thus, if the training data really frequently sees characters together (as, of course, it does in common words), then these words get a single token. Which, given how an LLM works, really makes sense because it looks for statistical relationships among strings of tokens. Thus, the way I think of it is that byte pair encoding is essentially like a pre-processing step that already optimizes for statistical relationships among individual characters.

RC_ITR · on April 5, 2023

In practice, GPT uses byte-pair encoding [0] for each Unicode character.

That’s why cases are treated differently - they’re different in Unicode.

This is also the only way to teach a model how to properly capitalize things (since there are no human defined rules).

[0] https://towardsdatascience.com/byte-pair-encoding-subword-ba....

totony · on April 5, 2023

The actual tokenizer often does not matter since you can add pre processors/normalizers. I assume they did it like this because capitalization matters in a lot of contexts

tel · on April 5, 2023

Similarly, pre-processing can be harmful. I think there are reasonable predictive differences when predicting the next-word follow up to a sentence that's properly capitalized versus one that's all lowercase. Not only will the "all lowercase" convention likely prevail in forward predictions, it also indicates something about the context of the writing, the author, their sense of style.

It's hard to argue that this information isn't (a) being captured by GPTs and (b) important. If you just threw it away, GPTs would have less information available to absorb.

PeterisP · on April 6, 2023

> Similarly, pre-processing can be harmful.

A good example is the initially released BERT-multilingual-uncased model back from the first BERT paper, which (without even mentioning it anywhere) not only collapsed the case but also removed diacritic marks from latin characters, thus killing its performance on those languages which heavily rely on them.

gradys · on April 5, 2023

The model is indeed so overpowered that it doesn’t matter in practice. See the Sentencepiece paper for some discussion of the design decisions on stuff like whitespace.

rafram · on April 5, 2023

Not all languages use capitalization the same way (or have it at all) and not all LLM input/output is natural language.

rickdeckard · on April 5, 2023

I don't think it's wasteful, if I ask GPT to process/generate a non-human language like a linux shell, capitalization is crucial...

king_magic · on April 5, 2023

It’s not surprising or bad design at all. Words mean different things depending on context, punctuation, etc.

isuckatcoding · on April 5, 2023

I wonder if this is why if you wrote things in all caps in chatgpt, it sometimes has some effect on the response.

williamstein · on April 5, 2023

I am glad it tokenizes Python and all other programming languages in a systematic way.

neerd · on April 5, 2023

They charge by the token so I’m not so sure about that