1. How many tokens can 'traditional' models (e.g. Mistral's 8x7B) fit on a single 80GB GPU?
2. How does quantization affect the single transformer layer in the stack? What are the performance/accuracy trade-offs that happen when so little of the stack depends on this bottleneck?
Mixtral 8x7b runs well (i.e., produces the correct output faster than I can read it) on a modern AMD or Intel laptop without any use of a GPU - provided that you have enough RAM and CPU cores. 32 GB of RAM and 16 hyperthreads are enough with 4-bit quantization if you don't ask too much in terms of context.
P.S. Dell Inspiron 7415 upgraded to 64 GB of RAM here.
They are claming that they resolved the vulnerability that caused the token leak but don't mention it. Doesn't exactly seem transparent to me or like handling it well.
I was contracting for them last year and tried, among other things to build an actual engineering culture that prevents and fixes issues that accumulate to catastrophic incidents like this.
They generally prefer to "ship fast".
I informed them very thoroughly again on January 13th (3+ months after they terminated me for "cultural differences"), because I was worried of this exact nightmare scenario happening very soon.
The reason for this was that they open sourced a package that let's an attacker easily practice and test locally in like a minute.
MDX exposes to Cross site Scripting easily.
I assume this is the "fixed vulnerability" they are talking about, just to be transparent.
1. How many tokens can 'traditional' models (e.g. Mistral's 8x7B) fit on a single 80GB GPU? 2. How does quantization affect the single transformer layer in the stack? What are the performance/accuracy trade-offs that happen when so little of the stack depends on this bottleneck?