For 30-40B parameter models, you'll see two types of performance impacts:
First, there's a direct throughput improvement – our benchmarks show a 14.5% speed increase with K8V4 versus FP16. This comes from better memory bandwidth utilization when processing the KV cache.
However, this won't make a 30B model suddenly feel as responsive as a 7B model. The fundamental computation bottleneck remains – larger models need more matrix multiplications regardless of how efficiently you store the KV cache.
Where you might notice a bigger difference is in handling longer inputs. With 59% less memory used for KV cache, your system can dedicate more resources to computation rather than memory management, which can reduce stuttering during processing long documents.
The most noticeable improvement would be if you're currently hitting memory limits that force you to segment long inputs. Being able to process everything in one pass eliminates those artificial breaks.
@fennecbutt is spot-on that the core token generation speed is primarily determined by compute capability and model architecture. KVSplit complements those factors by optimizing memory usage, not by fundamentally changing the computation path.
Yup, this approach would likely work on NVIDIA/AMD GPUs as well - the underlying principle that keys require higher precision than values is hardware-independent.
The CUDA backend in llama.cpp already supports separate cache type settings with the `--cache-type-k` and `--cache-type-v` flags. Our particular patch is focused on Metal-specific optimizations, but the core technique transfers directly.
Regarding compatibility with other quantization methods - absolutely. This KV cache optimization is complementary to model weight quantization (Q4_K_M, GPTQ, AWQ, etc.). You can combine asymmetric KV cache precision with any model weight format.
Since KV cache quantization happens at runtime while processing tokens (separate from model weights), it doesn't conflict with how the model itself is quantized. They operate on different parts of the inference pipeline.
What would require additional work is integrating with specialized inference engines that have custom KV cache handling, like vLLM or TensorRT-LLM. Each would need its own implementation of asymmetric KV precision.
The most immediate GPU benefit would likely come from integrating these insights into the FlashAttention implementation directly, where the memory bandwidth savings could translate to even greater speedups on CUDA hardware.
You're right to question the perplexity impact - 0.86% isn't negligible. Our extended testing shows this impact remains fairly consistent across context lengths up to 16K, which was our test limit.
We haven't benchmarked at 64-128K contexts yet, but theoretically the relative perplexity impact should remain stable. The absolute impact could potentially compound with very long contexts, though.
The key difference from standard KV quantization is the asymmetric approach. Most implementations use K8V8 (8-bit for both) which has a 0.03% perplexity impact but only 47% memory savings. K8V4 pushes this to 59% savings with the 0.86% quality tradeoff.
For reference, the quality impact is still well below the typical 5% threshold where differences become noticeable in generated text. It's a reasonable tradeoff for the additional memory savings, especially at long contexts.
@smcleod - We're using the same underlying quantization methods, just applying them asymmetrically between keys and values. If your existing approach already uses lower precision for values than keys, you're likely getting similar benefits.
The memory savings from KVSplit scale proportionally with context length, so higher-RAM Macs (64GB/128GB) benefit even more in absolute terms. On a 128GB Mac Studio, you could potentially handle context windows in the hundreds of thousands of tokens.
However, KVSplit doesn't fundamentally change computation speed - just memory efficiency. Our benchmarks show a 14.5% throughput improvement with K8V4, but this comes from better memory locality, not reduced computation.
The "painfully slow" issue with large models on Apple Silicon stems primarily from the compute limitations, not memory constraints. A 70B parameter model will still run at similar token generation speeds regardless of available RAM or KV cache optimizations.
What KVSplit does is make better use of whatever memory you have available. It's particularly valuable when your bottleneck is context length rather than model size.
For practical Apple Silicon usage, the sweet spot remains smaller models (7B-13B) with now-expanded context windows. This lets you process significantly more text while maintaining reasonable generation speeds.
If your workflow needs both massive contexts AND large models, you'd still want to consider server-grade GPUs, but KVSplit helps push the boundary of what's feasible on Apple hardware.
Great question about the intuition! The difference comes from the core roles these components play in attention.
Keys determine which tokens to attend to - they create the actual attention pattern through similarity calculations. Values only store what information gets passed forward once attention is decided.
When a key vector is quantized too aggressively, it distorts the similarity calculations for every token interaction. A small error in keys can completely redirect attention to the wrong tokens.
Values, however, are much more forgiving. When a value vector is quantized, any error only affects the specific information content of that single token after the attention pattern is already established.
It's like a library catalog system vs. the books themselves. If catalog numbers (keys) are corrupted, you'll look in completely wrong sections. If some words in books (values) are smudged, you're still reading the right book - just with occasional noise.
Mathematically, keys participate in softmax calculations where small errors get exponentially amplified through the normalization process. Values just undergo linear weighted averaging, where errors tend to cancel out.
I first encountered this asymmetry in papers like "More for Keys, Less for Values" and "KV-AdaQuant," but wanted to quantify exactly how it impacts Apple Silicon inference. The 7× quality difference between K8V4 and K4V8 using identical memory was striking.
Thanks for the installation feedback too! I'll fix the placeholder and make the Python dependencies more flexible.
My understanding is that the roles of KVQ aren’t actually well understood and that while they’re called key/value/query tensors it’s not quite straightforward to tease out what they mean or the role they play.
With the K8V4 configuration providing 59% memory savings, you can effectively run contexts 2.4× longer on the same hardware. A model with a 2048 token context can now handle about 5000 tokens, while an 8K context model can reach approximately 19.5K tokens.
In practical terms, this means processing entire books at once on a MacBook, analyzing large codebases without splitting files, or maintaining comprehensive conversation history in chat applications.
The memory savings scale linearly with context length - the longer your context window, the more absolute memory you save. On my M4 MacBook with 8K context, I reduced KV cache from 176MB to 72MB. At 128K context, that same percentage saving would free up gigabytes.
This optimization is most valuable when you're context-window limited rather than model-parameter limited. If you're hitting OOM errors due to long inputs rather than large model weights, KVSplit directly addresses your bottleneck.
Yes, that's one of the key benefits - KVSplit works with any existing .gguf model without requiring reconstruction or special conversion. The quantization happens at runtime on the KV cache, not during model loading or conversion.
This works because the KV cache is created during inference as tokens are processed, completely separate from the model weights themselves. The --kvq-key and --kvq-val flags simply tell llama.cpp how to store these intermediate tensors in memory.
The only limitation is that it requires llama.cpp's Metal backend, and you need to disable Flash Attention with -fa 0 since the current FA implementation in llama.cpp bypasses the custom KV cache format. The technique itself should work with any transformer architecture that uses a standard attention mechanism.
First, there's a direct throughput improvement – our benchmarks show a 14.5% speed increase with K8V4 versus FP16. This comes from better memory bandwidth utilization when processing the KV cache.
However, this won't make a 30B model suddenly feel as responsive as a 7B model. The fundamental computation bottleneck remains – larger models need more matrix multiplications regardless of how efficiently you store the KV cache.
Where you might notice a bigger difference is in handling longer inputs. With 59% less memory used for KV cache, your system can dedicate more resources to computation rather than memory management, which can reduce stuttering during processing long documents.
The most noticeable improvement would be if you're currently hitting memory limits that force you to segment long inputs. Being able to process everything in one pass eliminates those artificial breaks.
@fennecbutt is spot-on that the core token generation speed is primarily determined by compute capability and model architecture. KVSplit complements those factors by optimizing memory usage, not by fundamentally changing the computation path.