These seem rather long. Do they count against my tokens for each conversation? O...

fenomas · on Aug 27, 2024

I've wondered about this - you'd naively think it would be easy to run the model through the system prompt, then snapshot its state as of that point, and then handle user prompts starting from the cached state. But when I've looked at implementations it seems that's not done. Can anyone eli5 why?

pizza · on Aug 27, 2024

It def is done (kv caching the system prompt prefix) - they (Anthropic) also just released a feature that lets the end-user do the same thing to reduce in-cache token cost by 90% https://docs.anthropic.com/en/docs/build-with-claude/prompt-...

tomp · on Aug 27, 2024

Tokens are mapped to keys, values and queries.

Keys and values for past tokens are cached in modern systems, but the essence of the Transformer architecture is that each token can attend to every past token, so more tokens in a system prompt still consumes resources.

fenomas · on Aug 27, 2024

That makes sense, thanks!

daghamm · on Aug 27, 2024

My long dev session conversations are full of backtracking. This cannot be good for LLM performance.

tritiy · on Aug 27, 2024

My guess is the following: Every time you talk with the LLM it starts with random 'state' (working weights) and then it reads the input tokens and predicts the followup. If you were to save the 'state' (intermediate weights) after inputing the prompt but before inputing user input your would be getting the same output of the network which might have a bias or similar which you have now just 'baked in' into the model. In addition, reading the input prompts should be a quick thing ... you are not asking the model to predict the next character until all the input is done ... at which point you do not gain much by saving the state.

cma · on Aug 27, 2024

No, any randomness is from the temperature setting that just tells mainly tells how much to sample the probability mass of the next output vs choose the exact next most likely (which tends to make them get in repetitive loop like convos).

pegasus · on Aug 27, 2024

There's randomness besides what's implied by the temperature. Even when temperature is set to zero, the models are still nondeterministic.

trevyn · on Aug 27, 2024

>Do they count against my tokens for each conversation?

This is for the Claude app, which is not billed in tokens, not the API.

perforator · on Aug 27, 2024

It still imposes usage limits. I assume it is based on tokens as it gives your a warning that long conversations use up the limits faster.