> but the current tokenization merges digits according to their frequency Haha, ...

PeterisP · on May 9, 2023

The simple answer is that the same subword tokenization algorithm is used for everything, for all symbols of all languages in all alphabets and of all domains (books, tweets, code, etc) and for all other symbols like emoji, which include combined characters, punctuation. If you'd optimize for digit-specific tasks, it would make all sense to have special treatment for digits, but the current widely used models don't seem to do that, at least GPT up to GPT-3.5 doesn't - you can try it out here https://platform.openai.com/tokenizer . And it kind of makes sense, because in actual usage seen in training data IMHO digits are most likely not used for math to represent decimal integers, they're used as phone numbers or components of identifiers like "GPT-3" or parts of mail addresses, things like that which are more common in textual data than math.

mlyle · on May 9, 2023

I dunno. Sometimes a group of numbers has a non-mathematical semantic meaning that's a good mapping to digits-- like an area code or '777'. A lot of the rest of the time it's pretty random. A tokenizer's job is to lower the size of the input vector for a given amount of input meaning without obscuring the real underlying relationships too much, and here it feels like it doesn't meet that goal.

My phone number is 6 tokens instead of 12 symbols... so this is only going to make a moderate difference on things like big lists of phone numbers.