> but the current tokenization merges digits according to their frequency
Haha, that's even worse. I've not looked at the tokenization in depth; I just assumed digits were individual symbols. Thank you for the correction.
Any idea why this tokenization was used for digits? I understand that being blind to the input content and just learning a tokenization through frequency analysis has its merits for language, but the whole number thing seems awful. Any benefit on density fitting into context window seems worthless with how much harder it makes understanding of what the numbers mean.
The simple answer is that the same subword tokenization algorithm is used for everything, for all symbols of all languages in all alphabets and of all domains (books, tweets, code, etc) and for all other symbols like emoji, which include combined characters, punctuation. If you'd optimize for digit-specific tasks, it would make all sense to have special treatment for digits, but the current widely used models don't seem to do that, at least GPT up to GPT-3.5 doesn't - you can try it out here https://platform.openai.com/tokenizer . And it kind of makes sense, because in actual usage seen in training data IMHO digits are most likely not used for math to represent decimal integers, they're used as phone numbers or components of identifiers like "GPT-3" or parts of mail addresses, things like that which are more common in textual data than math.
I dunno. Sometimes a group of numbers has a non-mathematical semantic meaning that's a good mapping to digits-- like an area code or '777'. A lot of the rest of the time it's pretty random. A tokenizer's job is to lower the size of the input vector for a given amount of input meaning without obscuring the real underlying relationships too much, and here it feels like it doesn't meet that goal.
My phone number is 6 tokens instead of 12 symbols... so this is only going to make a moderate difference on things like big lists of phone numbers.
Haha, that's even worse. I've not looked at the tokenization in depth; I just assumed digits were individual symbols. Thank you for the correction.
Any idea why this tokenization was used for digits? I understand that being blind to the input content and just learning a tokenization through frequency analysis has its merits for language, but the whole number thing seems awful. Any benefit on density fitting into context window seems worthless with how much harder it makes understanding of what the numbers mean.