Hacker News new | past | comments | ask | show | jobs | submit login

Interestingly they seem to have different token ids for "Word", "word", " Word" and " word". That seems kind of a wasteful design.

It seems like it would make more sense to have a single token for all variants and then a "capitalized where not expected" token (e.g. "foo Foo"), a "not capitalized where expected" token (e.g. "foo. foo") and a "missing space where expected" token (e.g. "foo.Foo").

The lack of any normalization also means that WrItInG tExT lIkE tHiS will make future GPT versions not be able to make full use of the text during future training unless they change the tokenization (or the model is so overpowered that it doesn't matter).




The tokenization is a statistical product of the frequency of byte sequences in the training corpus. It might seem unintuitive but I wouldn't go so far as to say it's "wasteful". It may very well be but frankly you'd have to have a good explanation for why byte pair encoding is so much more successful than other tokenization schemes.


> why byte pair encoding is so much more successful than other tokenization schemes.

what's the evidence for that please? just asking because i dont know, not because i disagree. ive read a bunch of BPE explainers but nobody has bothered to explain why or how we landed on BPE


I'm not an AI expert, so I don't know what research has been done to verify it, but this comment below, https://news.ycombinator.com/item?id=35454839 , helped me understand it, and intuitively I think it makes sense.

That is, byte pair encoding tokenization is itself based on how common it is to see particular characters in sequential order in the training data. Thus, if the training data really frequently sees characters together (as, of course, it does in common words), then these words get a single token. Which, given how an LLM works, really makes sense because it looks for statistical relationships among strings of tokens. Thus, the way I think of it is that byte pair encoding is essentially like a pre-processing step that already optimizes for statistical relationships among individual characters.


In practice, GPT uses byte-pair encoding [0] for each Unicode character.

That’s why cases are treated differently - they’re different in Unicode.

This is also the only way to teach a model how to properly capitalize things (since there are no human defined rules).

[0] https://towardsdatascience.com/byte-pair-encoding-subword-ba....


The actual tokenizer often does not matter since you can add pre processors/normalizers. I assume they did it like this because capitalization matters in a lot of contexts


Similarly, pre-processing can be harmful. I think there are reasonable predictive differences when predicting the next-word follow up to a sentence that's properly capitalized versus one that's all lowercase. Not only will the "all lowercase" convention likely prevail in forward predictions, it also indicates something about the context of the writing, the author, their sense of style.

It's hard to argue that this information isn't (a) being captured by GPTs and (b) important. If you just threw it away, GPTs would have less information available to absorb.


> Similarly, pre-processing can be harmful.

A good example is the initially released BERT-multilingual-uncased model back from the first BERT paper, which (without even mentioning it anywhere) not only collapsed the case but also removed diacritic marks from latin characters, thus killing its performance on those languages which heavily rely on them.


The model is indeed so overpowered that it doesn’t matter in practice. See the Sentencepiece paper for some discussion of the design decisions on stuff like whitespace.


Not all languages use capitalization the same way (or have it at all) and not all LLM input/output is natural language.


I don't think it's wasteful, if I ask GPT to process/generate a non-human language like a linux shell, capitalization is crucial...


It’s not surprising or bad design at all. Words mean different things depending on context, punctuation, etc.


I wonder if this is why if you wrote things in all caps in chatgpt, it sometimes has some effect on the response.


I am glad it tokenizes Python and all other programming languages in a systematic way.


They charge by the token so I’m not so sure about that




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: