Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> More specifically they spit out a list of "logits" aka the probability of every token (llama has a vocab size of 32000, so you'll get 32000 probabilities).

I would have thought you have 7B probabilities (7b possible tokens) and 32k is just the context. So for every token (up to 32k, because that's when it runs out of context/size), you have a 7b probability.

I feel like you mixed up context size with # of parameters?



> I feel like you mixed up context size with # of parameters?

Respectfully I do not think I did :)

The 7B is the parameter count. 32k is the vocab size, or list of tokens.

I don't have access to the llama repo at the moment so I'll use this one - https://huggingface.co/NousResearch/Llama-2-7b-chat-hf/blob/...

The vocab list is here https://huggingface.co/NousResearch/Llama-2-7b-chat-hf/raw/m...

vocab_size is 32k. max_position_embeddings is 4096 for the context. Note that you can use a longer or shorter context length, but unless you use some tricks, longer context will result in rapidly decreasing performance.

You will get 32k logits as the end result.


Can you help me understand how a parameter and a token differ?

> 32k is the vocab size, or list of tokens.

This sounds to me like it is choosing from 1 of 32k tokens when it is scoring/generating an answer.

Where does the 7,000,000,000 parameters come from then? I would have thought it is picking from 1 of 7B parameters?

Parameter != token?


> Parameter != token?

They are two different things. A parameter is not what programmers call "parameters/arguments", it's the total number of weights from all of the layers in the network.

A token is a number that represents some character(s), there is a map of them in the vocab list. Similar to ascii or UTF8, but a token can be either a single character or multiple. For example "I" might be a token and "ca" might be a token. Shorter words especially might be a single token, but most words are a combination of tokens.

> This sounds to me like it is choosing from 1 of 32k tokens when it is scoring/generating an answer. [...] I would have thought it is picking from 1 of 7B parameters?

The model doesn't do that itself. The last layer of the model maps it to a single list that is 32k long, and outputs that of tokens and each ones probability (the combination of a (token, probability) is called a logit). So the output is a list of logits.

The step after this is external to the model - sampling. You can choose the highest scoring token if you want (greedy sampling), but usually you want to make it a little random with either one of the top 20 tokens (top_k) or from the best 90th percentile tokens (top_p). There's other fun tricks like logit biasing.

Finally you map the tokens to their respective values from the vocab list in my previous comment.

> Where does the 7,000,000,000 parameters come from then?

Here's a walk through for how to count the parameters for Llama-13B. It's worth noting that most of these models are commonly rounded (that's why you might see llama-33B be called llama-30B).

https://medium.com/@saratbhargava/mastering-llama-math-part-...

https://github.com/saratbhargava/ai-blog-resources/blob/main...


> there is a map of them in the vocab list

Is the "parameters" the vocab list?

> The model doesn't do that itself. The last layer of the model maps it to a single list that is 32k long

So from the 7B parameters, it picks which 32k potential parameters to use as tokens? So the "vocab list" is a 32k subset of the 7B parameters that changes with each request?


> Is the "parameters" the vocab list?

The vocab list is the mapping of tokens to their utf encodings. for example, the token 680 is "ide", token 1049 is "land".

> So from the 7B parameters, it picks which 32k potential parameters to use as tokens? So the "vocab list" is a 32k subset of the 7B parameters that changes with each request?

No, do not think of parameters and tokens as comparable in any way (at least right now). The parameter count is just the total number of weights and biases throughout all of the layers. The vocab size is static, and will never change during inference.

this is simplified - you give it a list of tokens, and then there's a bunch of linear algebra with matrices of "weights". All of those weights combined is the 7B parameters.

The final operation results in a 32k long list of probabilities. The 680'th item in that list is the probability of "ide", the 1049th item is the probability of "land".

The model never "picks" anything, there are no "if" statements so to speak, you give it an input and it does a bunch of multiplication resulting in a list of predictions 32k long.

The model does not "pick the best token" at any point. It simply hands you a 32k predictions. It's your job to pick the token, you can certainly just pick the highest probability but that's not usually the sampling method people use https://towardsdatascience.com/how-to-sample-from-language-m...

I highly recommend 3blue1brown to learn more about neural networks (and anything math related he's great) https://www.3blue1brown.com/lessons/neural-networks

tl;dr - very oversimplified - you multiply/add 7B numbers with your input list resulting in 32k predictions, which you map to the letters in the vocab list.


> The vocab size is static, and will never change during inference.

I feel like the open source LLMs that are out there are never advertised/compared by their vocab list size.

7B weights to pick which of 32k tokens to pick, over and over (per token, sequentially)


> I feel like the open source LLMs that are out there are never advertised/compared by their vocab list size.

Increasing the vocab size increases training costs with little improvement in evaluation performance (how "smart" it is) and relatively not a ton of evaluation speed improvement. Words not in the dictionary can be made with several tokens. GPT3.5turbo/GPT4 uses 2 for "compared", "comp" and "ared".

That's not to say the existing vocab lists can't be further optimized, but there's been a lot more focus on the parameter count, structure, training/finetuning optimizations like LoRA, quantization methods, and training data, as these are what actually embed the information of how to predict the correct token.

There are a few cases where the vocab list is very important and you will see it mentioned. The more human languages you want to support the more tokens you'll generally want. GPT3's old tokenizer used to not have 4 spaces " " as a token which wasn't great for programming, so their "codex" model had a different tokenizer that did.

Other cases for specialized tokens include special characters like "fill-in-the-middle", control tokens for "<System>","</System>","<Prompt>","</Prompt>", and a few things of that nature.

A lot of llama models have added a few extra control tokens for a few different purposes. Note that because tokens are mapped from a number - you can generally add a new token and just finetune the model a bit to embed the information in the model, but you generally don't want to change existing tokens which have already had their usage "baked" into the model.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: