How do any of these sliding window techniques handle instructions that are non expected and only show up at the end? For example imagine feeding a book to the model and the last sentence being the instruction “return the count of the letter m in the previous input”. A human would handle this by first letting out an exasperated sigh but then restarting the reading while counting. An LLM has no ability to loop back and re-read the input. (Ignore LLM issues with character counting for this example). It seems like to solve this problem for real the LLM needs to be able to loop and jump arbitrarily, but I’m sure that would introduce a whole new host of issues and possibly require a new architecture all together.
On a similar note, I can't wait for LLMs to digest _all_ the research papers readable enough for them and accessible, "take notes" in an index-suitable format/structure, and then act similar to a human who'd done that over an obviously more limited corpus: respond to questions by translating them into relevant key words, looking them up, _skimming the contents again,_ and finding relevant information. Might not be useful, and thus necessitate further visits to the index/library.
With the needed preprocessing, a LLM that can "go and do some research to adequately respond" could be extremely powerful.
We've spent the last ~10 millennia improving knowledge management technology to scale beyond the capacity/time of individual brains. Let the language model use actual research on this and pre-digest, not just Bing search.
No need for it's short term memory to remember what say piece of code did something, just tag it when reading and rely on scalable shared indexing of tags.
Though the more I think about it, the more it sounds like normal LLM pretraining with the knowledge index being the giant chunk of LLM weights.
One option would be similar to function calling, give the llm an output it can make that changes how the context is parsed. That's a layer on top rather than changing how the llm itself works.
Does an LLM need to loop back to re-read its input, even in a regular (read non-sliding) context window?
Maybe I'm misunderstanding, but doesn't the hidden state solve the "lookup" problem in this case? In the sense that the LLM needs to ingest your entire input anyway before answering, then whether your instruction is at the front or at the end carries little impact besides on attention.
It's my understanding that in regular non-sliding window context models the llm is able to pay attention to any part of the input when generating the output. The attention head is essentially able to jump back and forward to any point in its context window. This is what differentiates the attention mechanism from other models that use token proximity as a proxy for relevance.
The fact that people are still treating it like entirely raw text input is insane to me. If you have a document, have a separate input for the user to paste/upload data, and then another for the user's instruction.
That allows you to do things like chunk the document while leaving the rest of their instruction alone, or do a sliding window of just the document while your instruction stays static.
Ignore the specific example of counting characters, I was just quickly coming up with a situation where the instruction is at the end of the input. Here is a better example:
Input the full text of a novel, then ask for a minor detail (eg color of a car that is briefly mentioned in the middle of the book). Again a human can do this by flipping back to the relevant section but LLMs have no mechanism for this when using a sliding window attention scheme.
If the full input can fit in the context window then any LLM today would be able to extract the color of the car.
I agree, even just tokenization screws you here, I'm 95% sure. I.e. the raw input isn't letters but one of 100K integers that represent some set of letters.
That being said, probably a naive take, since we're seeing them do so much. & I bet we could get it to count correctly with at least some short input, and given infinite runs, probably trivial. (I.e. for N characters, split into N inputs, for each one "say true if it is an M, false otherwise,)
I understand that, which is why I said "Ignore LLM issues with character counting for this example". It was a quick example, please see my other comment with a better example.
I see, active listening + relating it to my knowledge on my end, lmk if I compressed too much:
you're curious if there's noticably worse performance if the Q is at the end of content rather than before
No, there's a good paper on this somewhere with the Claude 100K, tldr it's sort of bow-shaped, beginning and end had equally high rates but middle would suffer
No, what I am specifically asking about is these sliding window attention techniques. As far as I understand it Claude 100K actually uses a 100k context window, and not a sliding window.