If you’re curious about how much KV Cache quantization affects Qwen3.5 27B, take a look at the table below.
The model used in all of these benchmarks is Unsloth’s Q8_K_XL.
Table of Contents
KV Cache BF 16 vs F16 vs Q8_0
| KV Cache Type | Mean PPL(Q) | ΔPPL (Q – base) | PPL Ratio | ln Ratio | Mean KLD | Max KLD | RMS Δp (%) | Same Top-p (%) |
|---|---|---|---|---|---|---|---|---|
| BF16 | 6.8653 ± 0.04470 | — | — | — | — | — | — | — |
| F16 | 6.866214 ± 0.044707 | 0.004435 ± 0.001083 | 1.000646 ± 0.000158 | 0.000646 ± 0.000157 | 0.000884 ± 0.000067 | 5.223969 | 0.873 ± 0.041 | 98.796 ± 0.028 |
| Q8_0 | 6.864972 ± 0.044693 | 0.003193 ± 0.001113 | 1.000465 ± 0.000162 | 0.000465 ± 0.000162 | 0.000974 ± 0.000068 | 4.321135 | 0.930 ± 0.048 | 98.720 ± 0.029 |
These benchmarks were ran using llama-perplexity against the WikiText dataset at 512 tokens.
So what are the results of the benchmark?
BF16 is clearly better than F16 and Q8_0. And if you’re going to run F16, you have no reason not to run BF16 unless your hardware doesn’t support it.
But what about F16 vs Q8_0? They are very similar with F16 being just a very tiny bit better. Note that the margin of error in these stats (the ± symbol) means that most of these values (F16 vs Q8_0) are indistinguishable from each other.
Recommended GPU for local LLM
For local LLM usage, Q8_0 KV cache is clearly the best choice as it will allow you to run much bigger context lengths, and these are the GPUs I recommend.
| Model | RTX 5060 Ti | RTX 5070 Ti | RTX 5080 | RTX 5090 |
| Brand | ASUS | ASUS | GIGABYTE | ASUS |
| VRAM (GB) | 16 GB | 16 GB | 16 GB | 32 GB |
| VRAM Type | GDDR7 | GDDR7 | GDDR7 | GDDR7 |
| Memory Bandwidth | 448 GB/s | 896 GB/s | 960 GB/s | 1790 GB/s |
| Price (USD) | HERE | HERE | HERE | HERE |
I don’t show any GPUs lower than 16GB VRAM because they are just not worth the cost. Moreover, anything lower than 400 GB/s of memory bandwidth isn’t worth it either.