EDIT: the authors have updated the readme to add a clarified FAQ section that di...

EDIT: the authors have updated the readme to add a clarified FAQ section that directly addresses this: https://github.com/mit-han-lab/streaming-llm#faq

Just tested it - this definitely doesn't seem to be giving enhanced context length. It does run quickly though, can confirm it was using about 35 GB of an A100 RAM and pinned the usage for the entire duration.

I ran through by getting a book from project gutenberg, splitting it into paragraphs, and feeding them in paragraph by paragraph (asking it to say "okay" each paragraph), then at the end, asked some questions. It entirely hallucinated its answers. (also note: in the ~10 min of playing with this, i couldn't get the base model (lmsys/vicuna-13b-v1.3) to respond in english...)

https://gist.github.com/bluecoconut/9cae9e91fe3b1616ed650a96...