Hi folks – I work at OpenAI and helped build this page, awesome to see it on here!
Heads up that it's a bit out of date as GPT4 has a different tokenizer than GPT3. I'd recommend checking out tiktoken (https://github.com/openai/tiktoken) or this other excellent app that a community member made (https://tiktokenizer.vercel.app)
I wasn't aware that GPT-3 and GPT-4 use different tokenizers. I've read https://github.com/openai/openai-cookbook/blob/main/examples... and misinterpreted "ChatGPT models like gpt-3.5-turbo and gpt-4 use tokens in the same way as older completions models, ..." as GPT-3 and GPT-4 using the same tokenizer except for im_ tokens. Now I can see so many improvements, including the encoding of whitespaces and digits.
Hey it seems that UTF-8 support is broken on the page.
Test phrase could be something like "Жизнь прекрасна и удивительна" ("Life is great" in russian).
I make an assumption that this is the implementation on the page that is broken, not the actual tokenizer. The reason: russian works perfectly in GPT-3 which I guess wouldn't be the case with a tokenization as presented on the page.
Author here, you are correct! The issue here is due to the fact that a single user-perceived character might span into multiple tokens. This should be fixed now.
Are there plans to release tokenisers for other platforms? I'm accessing the OpenAI API from Clojure, and it would be really nice to have a JVM version so I can estimate token use before sending.
That is very helpful, thank you. I had not realised the latest models were now tokenizing number as 3 digit groups. Can you give any insight into why 3 digits?
This tool is really useful for helping develop a better intuition for how GPT models actually work.
Paste in some text and switch to the token IDs view. Note how common words (like "the ") have low integer token IDs, while things like emojis are split into several numbers.
An LLM is a function that takes an array of integers and returns a new array of integers. Seeing the tokens like this helped me reinforce that mental model.
> An LLM is a function that takes an array of integers and returns a new array of integers.
To refine this a bit more, a LLM is a function that takes an array of integers (or really, a batch of arrays of integers), and returns a probability distribution for each possible integer, with the array shifted left by one place to enable prediction.
I've always wondered how stop tokens fit in here. Does the LLM generate a probability for "stop" in addition to every other token in the space? Or is stopping handled heuristically by the outer loop that generates the output tokens sequentially?
The API docs talk about letting you specify your own stop token (like "<!-->") but I don't think "token" is meant in the same sense here.
Yes, the model has something like an EOF token which it emits for the output to end. It is part of the probability distribution that the model predicts.
Yes! This is something that is done. The problem is that a) it’s tough to find a sane denominator as the likelihood of the entire sequence can be quite small, even though it’s the best answer and b) the answer isn’t grounded in anything, so the confidence score isn’t super helpful.
A score like this can be useful for active learning though, where you find areas of low confidence in your dataset and get more data to train on.
Yes, one distribution per position. This was a key innovation that allowed training over the entire sequence in parallel, rather than on one token prediction at a time, thereby massively speeding up training.
More recently, there are models like RWKV that can run in both parallel (GPT-like) mode for training and serial (RNN-like) mode for inference.
But transformers always output a probability distribution at each position in the context.
You unfold it one token at a time by sampling from the returned distribution. To control the amount of variation, you can make the probability distribution more extreme, in the most extreme case you only select the most likely token, and the sequence becomes deterministic.
Yes, what happens later in the sentence depends on the particular choice you made earlier in the sentence.
> "Antidepressants" I'd imagine tokenizes as "anti" "depress" "ant". But nope. And "antipsychotic" tokenizes differently from it too..
Tokens are symbols. You're thinking of them like embedding vectors. Tokens represent the step before a meaning is assigned to the text: it turns some unit of text into what's essentially an identifier.
Which is to say, two homonyms would have the same token id, even though they have different meanings. Tokens have no notion of context.
You could split on words instead of tokens, but then you need a large vocabulary, you can't deal with inputs that contain a word which is not in the vocabulary, and it's not so clear what a "word" even is.
Instead of coming up with more and more heuristics to chop a sequence of bytes up in "words" in a vocabulary, we could simply set a limit on the size of the vocabulary (number of tokens), put all bytes in there (so we can at least handle any input byte by byte), and pack the remaining space with the most common multi-byte byte sequences. Then you end up with tokens like here.
There is no 'meaning' inside these AI's. It's terribly confusing to think about these LLM's as having 'meaning' in the same way we humans do. It's all just statistics. Given a sequence of numbers (each representing some abstract token), what is most likely to come next. That's how 'simple' it is. It's also what makes it so amazing that these things work as well as they do. I giggle like a schoolgirl every time I get it to add some functionality to a function, or write an entire new function, and that's several times a day for what is now months on end. But the key to using them is seeing that there is no 'meaning' in them. It's all just streams of (to the machine) meaningless tokens.
There’s no meaning to the tokens, but research has shown that the models themselves capture meaning. Technically they are producing the next word but in order to do that for a dataset of a trillion words they actually have to develop internal models of how the world works. There was a post on HN a couple days ago that talked about the research done to show this.
You say that but we have models of meaning in humans too.
You can put people in an fMRI and ask them to think "car".
You can ask someone to think of objects and detect when they think "car".
What happened there pairing a bunch of tensors to meanings and matching them.
We can do something similar with embeddings.
To be clear I don't intend to give the impression that these LLMs are doing something miraculous. Just that we are increasingly peeling back the veil of how brains think.
> You can put people in an fMRI and ask them to think "car".
I don't know about other people, but when I think “car” really hard, I can feel the muscles in my throat adjust slightly to match the sound of the word “car”. Perhaps that sort of thing is what the MRI machines is picking up, rather than being able to pick up some kind of "internal representation" of car.
In fact it also picks up the parts of your brain to do with driving (if you're a driver). Maybe also the part to do with the smell of fuel in me, but not you.
It'll also light up in the parts of my brain to do with reading, writing, hearing the word in the languages I speak.
What does car mean to me if it doesn't connect to all the concepts that relate to cars?
If it just decides on a single token at a time, can it backtrack and choose differently under that operation, given the next tokens? What I wonder is, how can it plan ahead and output meaningful (to us) responses, like working code or useful articles? How can it "reason" logically when it needs to solve a problem, a riddle etc, by only selecting a token at a time? Wouldn't that dumbed down approach prove myopic for complex compositions? Doesn't it need some over-ruling goal-based heuristic system?
There’s no planning, no reason. It’s all ‘what word is next…’
I found Stephen Wolframs explanation helpful. He has a YouTube video version which I enjoyed too.
This blog post was on HN last month, but I never get good search results on hn
If we get a bit quantum (or an act of God for some), then backtracking could happen by collapsing the dead-ends and "changing" history to stay with what turns out to be the solid plan. Could emergent conscience on AI's neurons do the planning and reasoning that it rather seems to be doing but ML experts will say it is not? If our conscience could by any chance reside not in the electrical currents of the wetware, could AI's reason also not reside in tokens? Is there some mysterious process possible to be taking place?
It is wild that a process like that can generate working code. Humans speak their words in order, but they don't write their code in order. Why would writing code in order work?
There's no "better way" to do it because the tokens are all meaningless to ChatGPT, it only cares about how efficiently they can be parsed and processed.
The competing desires are to model all language with the biggest tokens possible, and the fewest tokens possible. The lines aren't meaningless, text is split into the largest possible chunks using a set of the most common tokens.
Common words, like "the", "fast", "unity", "flying" are all tokens, but it's not because they're words, it's because they're common letter clusters, undistinguished from "fl", "ing", "un", "ple"
"gadflying" is tokenized into [g, ad, flying], even though it's only loosely semantically related to "flying", it's just the most efficient way to tokenize it.
1. Greatly reduces memory usage. Instead of memorizing every inflection of the word "walk", it memorizes the root (walk) and the modifiers (ing, ed, er, ...). These modifiers can be reused for other words.
2. Allows for word compositions that weren't in the training set.
This is great for uncommon or new expressions like "googlification" or "unalive".
There’s no syntax or structure to the token set. The actual tokens were algorithmically selected based on the training data to (putting things loosely) optimize compression of the training data given a token set size.
Sure, but what I'm hearing in the parent post is a question about why we don't use linguistically motivated subword units (of similar length/vocabulary size and thus memory usage) e.g. cutting across morpheme boundaries instead of whatever an algorithm like BPE caclulates.
Imagine a circular list (in however you want to construct that) that matches the input size for the model.
The prompt is initially loaded at the start of the list and the model is run and produces high activation on a single output. That token output is then fed to the end of the input circular list and also added to the "this is what the model returned."
This process of running the model, getting the token output and sending one copy to the input list and one copy to the return string is repeated until the number of tokens generated hits a numeric limit or a token that represents the stop token is encountered.
It's a lot simpler than that. You can see in the tokenizer that the boundary for words includes the preceding space. So, since the first word doesn't have a preceding space, it has a different token.
I found this tool recently when it was linked from this Computerphile video[1] about "glitch tokens".
tldw:
Certain junk data was thrown out post-tokenization e.g. the /r/counting[2] community data and debug logs from Rocket League
some tokens specific to those contexts stuck around, however, and are now like "a color you've never seen before" as far as GPT-X models are concerned
giving the model one of these "glitch" tokens causes it to kind of freak out and return gibberish or some completely random response because it has not encountered them during training, because they were removed when the data was cleaned.
Byte pair encoding by construction is quadratic on the length of the words.
And usually the input is pre-split into words before being given to the byte pair encoder.
Hopefully they use something different implementation in prod.
It needs to be sanitized against very long words (like 10k character long words :) ).
In previous tokenizer like CLIP (https://github.com/openai/CLIP/blob/main/clip/simple_tokeniz... ) , they used additional preprocessing steps like html escaping and various cleanup preprocessing using some python library (ftfy, html and regex), which made porting the code exactly to other languages a real pain.
In theory Byte Pair Encoding is unique, but practice makes it harder. It's also complicated due to regex and utf-8. Most of the time the differences should be too important because the neural network should be able to handle typos.
In BPE you may have plenty of escaping problems, problematic character like ' and \ are nasty to get right : worst case if you don't handle your errors being that if you have trained your byte pair encoding dictionary on escaped sentences, then a single \ should never occur as it is encoded as \\, so if you split the string between the \ then the byte pair encoding might fail to find the key in the dictionary.
Making the thing deterministic and stable when you change your regex version (and when you train one network you'd like to not have to retrain it when there is a bugfix in a regex library). Porting to other platforms also becomes very hard if you want replicable results.
We noticed that this webpage is out of date for recent models so we (diagram.com) commissioned a better one that lets you pick any of OpenAI's models including chats:
Wow, I can't thank you enough for this. Somehow I never noticed that GPT-4 doesn't use separate tokens for each tab like 3.5. I was wasting a lot of time minimizing excessively tabbed code to save on the token count! Like seriously way too much time, all based on a bad assumption.
Interestingly they seem to have different token ids for "Word", "word", " Word" and " word". That seems kind of a wasteful design.
It seems like it would make more sense to have a single token for all variants and then a "capitalized where not expected" token (e.g. "foo Foo"), a "not capitalized where expected" token (e.g. "foo. foo") and a "missing space where expected" token (e.g. "foo.Foo").
The lack of any normalization also means that WrItInG tExT lIkE tHiS will make future GPT versions not be able to make full use of the text during future training unless they change the tokenization (or the model is so overpowered that it doesn't matter).
The tokenization is a statistical product of the frequency of byte sequences in the training corpus. It might seem unintuitive but I wouldn't go so far as to say it's "wasteful". It may very well be but frankly you'd have to have a good explanation for why byte pair encoding is so much more successful than other tokenization schemes.
> why byte pair encoding is so much more successful than other tokenization schemes.
what's the evidence for that please? just asking because i dont know, not because i disagree. ive read a bunch of BPE explainers but nobody has bothered to explain why or how we landed on BPE
I'm not an AI expert, so I don't know what research has been done to verify it, but this comment below, https://news.ycombinator.com/item?id=35454839 , helped me understand it, and intuitively I think it makes sense.
That is, byte pair encoding tokenization is itself based on how common it is to see particular characters in sequential order in the training data. Thus, if the training data really frequently sees characters together (as, of course, it does in common words), then these words get a single token. Which, given how an LLM works, really makes sense because it looks for statistical relationships among strings of tokens. Thus, the way I think of it is that byte pair encoding is essentially like a pre-processing step that already optimizes for statistical relationships among individual characters.
The actual tokenizer often does not matter since you can add pre processors/normalizers. I assume they did it like this because capitalization matters in a lot of contexts
Similarly, pre-processing can be harmful. I think there are reasonable predictive differences when predicting the next-word follow up to a sentence that's properly capitalized versus one that's all lowercase. Not only will the "all lowercase" convention likely prevail in forward predictions, it also indicates something about the context of the writing, the author, their sense of style.
It's hard to argue that this information isn't (a) being captured by GPTs and (b) important. If you just threw it away, GPTs would have less information available to absorb.
A good example is the initially released BERT-multilingual-uncased model back from the first BERT paper, which (without even mentioning it anywhere) not only collapsed the case but also removed diacritic marks from latin characters, thus killing its performance on those languages which heavily rely on them.
The model is indeed so overpowered that it doesn’t matter in practice. See the Sentencepiece paper for some discussion of the design decisions on stuff like whitespace.
> A helpful rule of thumb is that one token generally corresponds to ~4 characters of text for common English text. This translates to roughly 3/4 of a word (so 100 tokens ~= 75 words).
Just for fun I tried entering in "pneumonoultramicroscopicsilicovolcanoconiosis" and "antidisestablishmentarianism". The first was pretty evenly split into tokens of length 1-5 characters, but the second put all of "establishment" into a single token.
No useful conclusions drawn, but it was an interesting test.
I desperately want to be able to get a concrete amount of tokens for my prompt before making a call - things like this make it very hard to request the right amount of max_tokens from longer prompt/generation pairs.
tldr: start with unary characters and greedily merge pairs that are the most frequents
A consequence is that an encoding is suited for the dataset it was trained on, so if a language is under-represented in the data it will result in higher number of tokens to encode it
> The main difference to other compression algorithms, such as Huffman encoding, which have
been proposed to produce a variable-length encoding of words for NMT (Chitnis and DeNero,
2015), is that our symbol sequences are still interpretable as subword units, and that the network can generalize to translate and produce new words
(unseen at training time) on the basis of these subword units.
I don't see why Huffman encoding doesn't give you that same interpretability?
Actually the algorthm for producing a Hoffman tree is very similar to that for BPE:
> The process begins with the leaf nodes containing the probabilities of the symbol they represent. Then, the process takes the two nodes with smallest probability, and creates a new internal node having these two nodes as children. The weight of the new node is set to the sum of the weight of the children. We then apply the process again, on the new internal node and on the remaining nodes (i.e., we exclude the two leaf nodes), we repeat this process until only one node remains, which is the root
heh. all i know is this is a fun magic token but 1) i dont really know how they found this and 2) i dont know what its implications are. i heard that you can use it to detect if you are talking to an AI.
I think it's related to Reddit users who posted (very frequently!) on a counting focused subreddit (people literally post "1", "2" , "3" in sequence so usernames appear 50k+ times). Some screenshots and links in this Twitter thread: https://twitter.com/SoC_trilogy/status/1623118034960322560
"They" as in OpenAI, when they trained the tokenizer, just dumped a big set of text data into a BPE (byte pair encoding) tokenizer training script, and it saw that string in the data so many times that it ended up making a token for it.
"They" as in the rest of us afterward... probably just looked at the token list. It's a little over fifty thousand items, mostly short words and fragments of words, and can be fun to explore.
The GPT-2 and GPT-3 models proper were trained on different data than the tokenizer they use, one of the major differences being that some strings (like " SolidGoldMagikarp") showed up very rarely in the data that the model saw. As a result, the models can respond to the tokens for those strings a bit strangely, which is why they're called "glitch tokens". From what I've seen, the base models tend to just act as if the glitch token wasn't there, but instruction-tuned models can act in weirdly deranged ways upon seeing them.
The lesson to learn overall AIUI is just that you should train your tokenizer and model on the same data. But (also AIUI - we don't know what OpenAI actually did) you can also simply just remove the glitch tokens from your tokenizer, and it'll just encode the string into a few more tokens afterward. The model won't ever have seen that specific sequence, but it'll at least be familiar with all the tokens in it, and unlike never-before-seen single tokens, it's quite used to dealing with never-before-seen sentences.
It doesn't necessarily mean it scraped twitch chat. That is the name of a moderator. They also moderate the subreddit and probably some other places. And being a moderator for such a popular event they probably had their name mentioned in other places as well. Every time they comment on Reddit their username would also appear.
Tiktoken is pretty nice. I've been exposing it as an internal service in our infrastructure so that we can get token counts easily. The bigger problem is figuring out how to chunk longer contexts so that you stay within the context window limit defined by the model you are using.
It completely butchers Greek. No wonder it charges some much for so little output. Every Greek character is a token.
I wonder if there is space for innovation there. I would imagine that it similarly difficult for other non-English languages as well-known. I fear for the effect this will have on them.
It's crazy that OpenAI hasn't fixed their tokenizer yet. They are leaving the door wide open for some Chinese big tech company to capture the non-Latin script parts of the world.
i18n (and accessibility) was something American tech companies were serious about in the 90s and early 2000s. That is how they captured most of the global market. US tech dropping the ball on this leaves the door wide open for Chinese competitors.
Does OpenAI’s tokenizer issues cash out into having worse results for Greek rather than just being more expensive for gpt-4? (gpt-3.5-turbo already costs peanuts)
If not, then this response seems overblown. The competitive advantage in LLM at this point probably is not tokenizer optimizations and more about having results worth a damn.
There is a market opportunity here for a GPTesque thinking machine who actually masters and knows their greek ancients well. I knew it it was a lack of refined Platonic understanding when ChatGPT said it could not comment further on the Russian war.
It's probably operating on UTF-8 data on a byte-per-byte level without any additional processing. Just feeding it the raw string data and letting it assign the tokens.
It's similar to how it is splitting words at arbitrary points, rather than at clear morphological or lexical locations (e.g. on the Jane Austen text `"Now, ma'am," said Jane to her aunt, "shall we join Mrs. Elton?"` I've seen it tokenize that as `"|Now|,| ma|'|am|,"| said| Jane| to| her| aunt|,| "|shall| we| join| Mrs|.| El|ton|?"`).
I would find that hard to believe, as the bytes have zero semantic meaning, and moreover, pairing the wrong bytes in the output will result in complete gibberish. It would be akin to tokenizing each English letter "N|o|w|,| |m|a|'|a|m|..." except far worse.
Moreover it's trivially easy to tokenize the glyphs.
A character is the base unit of written communication. Single characters as tokens is not a bad idea, it just requires too much resources to make it learn and infer.
BPE is a tradeoff between single letters (computationally hard) and a word dictionary (can't handle novel words, languages or complex structures like code syntax). Note that tokens must be hardcoded because the neural network has an output layer consisting of neurons one-to-one mapped to the tokens (and the predicted word is the most activated neuron).
Human brains roughly do the same thing - that's why we have syllables as a tradeoff between letters and words.
For which alphabet, or for all alphabets? Kanji that would make sense, as each character is (sort of) a word. Hiragana and Katakana are phonetic, with each character usually representing a consonant -vowel pair, so even then there is more information content than a single English letter.
Japanese to English translator here. The general rule of thumb (that is often used for billing estimates) is that N Japanese characters = N/2 English words.
So if you have a Japanese source text that is 2,000 characters, the English translation will be around 1,000 words.
I tested a translation (one sentence) from a previous job:
Japanese: 94 characters, 128 tokens
English: 39 words (232 characters), 47 tokens
Seems quite unbalanced given that the amount of "information" in the two is equivalent.
It looks like this thing doesn't tokenize into anything with any semantic meaning, but rather just a sequence of bytes that match some sort of information theoretic criteria. It doesn't appear to have any linguistic (written, nor verbal) pattern. I guess it's fine for their specific use case, but whatever.
Tokenization is such a basic and domain specific operation, it feels like someone had to demo something.
Bonus (code) points for just saying "fuck it" on emojis. They didn't even split it into code points.
Completely useless, but I was curious about the token indexes. I tried to look for Token #0. After a couple minutes of trial and error, it turns out it's the exclamation mark.
Interesting... I was gonna say "you can ask GPT" but it doesn't work anymore.
On March 23rd, it responded with this:
Human: convert these GPT tokens to text: [134, 1322]
AI: The tokens [134, 1322] correspond to the words "can" and "not" in the GPT language model. So the text corresponding to these tokens would be "can not"
Today, it's giving me the "As a language model" response
One interesting fact I stumbled upon recently is that GPT2Tokenizer library and Tiktoken library produces the same number of tokens for `text-davinci-003` model, despite GPT2Tokenizer being GPT2 and text-davinci-003 being GPT3.5.
For code, however, Tiktoken library and GPT2Tokenizer produce different tokenizations.
fn hello(message: String) -> Result<String> {
this is not part of code
}
Codex does a pretty code job at tokenizing the different parts of the first line (fn, open parenthesis, -> is considered too, its own token) but fails miserably on the second line. The second line should be a single token of invalid code. It should be tokenized into text, if that line was preceded by "//" or a start comment indicator.
Interestingly, GPT-3/4 can probably explain the concept of commenting and specifically for Rust too. However, it can't apply it in this particular context.
Really interesting. How do these work? are these a separate ai/neural net/model to the transformer? they don't seem to follow any humanlike structure or process?
I took a random Java file I had laying around that I was working on lately.
~100 lines of code + whitespace
1300-1900 tokens
So if I fed this to OpenAI and said "how can I make this file better/improve upon it", it would have cost:
between $0.03 and $0.12 for this one file using GPT-4
not sure I could use gpt-3.5-turbo since it says it is for chat and not code?
Does that sound right? $0.05 for every file of source code scanned sounds too high for realistic usage. Even $0.01 sounds high? Modern company might have 1,000,000+ files of code, no?
GPT-4 costs 30 times more than gpt-3.5-turbo and 60ktimes more if you use the 32k token gpt-4 model. It's by far their most expensive service!
I'm using gpt-3.5-turbo, also for coding, and honestly it does just fine.
It would be cool if they told their $20/mo users "here's how much your past 30 day usage would have cost if we billed you via the API (aka how many tokens/sessions/chats/whatever did you use)
What's the benefit of OpenAI charging per-token instead of per-character or per-word?
Since token algorithms change model-to-model and version-to-version, it seems like they've added a lot of complication for no actual benefit to the user except for a little peek under the hood.
Is there a benefit to this scheme that I'm not seeing? Is there some way to game the system otherwise?
It's not that they're just charging per token -- the actual models are operating on a token level. The model sees things in terms of tokens, and in openai's case, these tokens are subword (pieces of words), not words themselves, not characters.
So the real question is, what is the benefit of modeling your tokens as subwords, rather than as characters or words?
I think there is a lot of nuance here, and I don't understand it all. But, some benefits:
* Words, at least in English, are composed of different pieces, like roots, prefixes, and stems. Modeling at the subword level more naturally aligns your model with this aspect of language. If I tokenize "warmest", I get "warm" and "est". So, the meaning of the token "est" can be learned by the model -- whereas if you modeled by words, the model would have to individually relearn this aspect of information for every word ending in "est".
* Modeling at the subword level makes your sequences a lot shorter than modeling at the character level, which should help with things like efficiency.
* Modeling at the subword level makes your vocabulary a lot bigger than just modeling at the character level, which I suspect helps the model, as it can assign the subwords themselves meaning. E.g., it can learn the meaning of the token "warm" on its own, rather than having to learn this meaning only through learning the relationship of the tokens "w" "a" "r" and "m".
Hope this helps! Would love for anyone else to chime in/add on/correct me.
The tokenizer doesn’t actually change model to model, by the looks of it this is still the GPT-2 tokenizer. Also the per-token cost makes sense because predicticting a token is a single forward pass through the model, while for other cost measures they would need to do some science to make it work out on average.
I found it interesting how it tokenizes non-English words:
Steve Jobs was fired from Apple -> [19206, 19161, 373, 6294, 422, 4196] (one token per whole word)
Olha que coisa mais linda e cheia de graça -> [30098, 3099, 8358, 763, 9160, 285, 15152, 300, 22261, 304, 1125, 544, 390, 7933, 50041] (tokens with up to 3 characters)
They do use this tokenization, and that's the reason why these models sometimes struggle with tasks like "how many twos does this long number contain" and things like "is 50100 greater than 50200" as it tries to compare "501"/"00" with "50"/"200" while knowing that "501" is greater than "50".
The models aren't optimized to be math friendly. They could be, but the major big generic ones weren't.
This works really poorly in non-latin scripts. Try pasting "Україна" (Ukraine) or "北京是中国的首都" (Beijing is the capital of China). I'm a little surprised that nobody optimized that, there must be enough training data to warrant this effort.
NOTE: this is only valid for the old models (GPT-3 and Codex). IIRC, there is no simple way to know the token usage for the new models (gpt3.5-turbo and beyond).
It’s made so people don’t game the system. Say “hello world” would be same as “hello_world” for llm. If they didn’t count tokens by letter, I would be using it for free