These seem rather long. Do they count against my tokens for each conversation?
One thing I have been missing in both chatgpt and Claude is the ability to exclude some part of the conversation or branch into two parts, in order to reduce the input size. Given how quickly they run out of steam, I think this could be an easy hack to improve performance and accuracy in long conversations.
I've wondered about this - you'd naively think it would be easy to run the model through the system prompt, then snapshot its state as of that point, and then handle user prompts starting from the cached state. But when I've looked at implementations it seems that's not done. Can anyone eli5 why?
Keys and values for past tokens are cached in modern systems, but the essence of the Transformer architecture is that each token can attend to every past token, so more tokens in a system prompt still consumes resources.
My guess is the following:
Every time you talk with the LLM it starts with random 'state' (working weights) and then it reads the input tokens and predicts the followup. If you were to save the 'state' (intermediate weights) after inputing the prompt but before inputing user input your would be getting the same output of the network which might have a bias or similar which you have now just 'baked in' into the model.
In addition, reading the input prompts should be a quick thing ... you are not asking the model to predict the next character until all the input is done ... at which point you do not gain much by saving the state.
No, any randomness is from the temperature setting that just tells mainly tells how much to sample the probability mass of the next output vs choose the exact next most likely (which tends to make them get in repetitive loop like convos).
One thing I have been missing in both chatgpt and Claude is the ability to exclude some part of the conversation or branch into two parts, in order to reduce the input size. Given how quickly they run out of steam, I think this could be an easy hack to improve performance and accuracy in long conversations.