Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It's not, it wouldn't make sense to have 100,000 tokens just for the first 100,000 numbers. There's a playground [1] where you can see how LLMs tokenize a string.

12345678987654321 is tokenized on various models like so:

  GPT4                123-456-789-876-543-21
  GPT3                123-45-678-98-765-43-21
  Llama-2, Mistral    1-2-3-4-5-6-7-8-9-8-7-6-5-4-3-2-1
[1] https://huggingface.co/spaces/Xenova/the-tokenizer-playgroun...


Looks like number string parsing may be of enough importance to warrant token look-ahead recursive sub-parsing, then use the most "promising" token-ization; the one that generates the highest yielding probability tree following that number string.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: