Qwen3.5 27B Q8 – KV Cache Benchmarks BF16 vs F16 vs Q8_0

If you’re curious about how much KV Cache quantization affects Qwen3.5 27B, take a look at the table below.

The model used in all of these benchmarks is Unsloth’s Q8_K_XL.

KV Cache BF 16 vs F16 vs Q8_0

KV Cache TypeMean PPL(Q)ΔPPL (Q – base)PPL Ratioln RatioMean KLDMax KLDRMS Δp (%)Same Top-p (%)
BF166.8653 ± 0.04470
F166.866214 ± 0.0447070.004435 ± 0.0010831.000646 ± 0.0001580.000646 ± 0.0001570.000884 ± 0.0000675.2239690.873 ± 0.04198.796 ± 0.028
Q8_06.864972 ± 0.0446930.003193 ± 0.0011131.000465 ± 0.0001620.000465 ± 0.0001620.000974 ± 0.0000684.3211350.930 ± 0.04898.720 ± 0.029

These benchmarks were ran using llama-perplexity against the WikiText dataset at 512 tokens.

So what are the results of the benchmark?

BF16 is clearly better than F16 and Q8_0. And if you’re going to run F16, you have no reason not to run BF16 unless your hardware doesn’t support it.

But what about F16 vs Q8_0? They are very similar with F16 being just a very tiny bit better. Note that the margin of error in these stats (the ± symbol) means that most of these values (F16 vs Q8_0) are indistinguishable from each other.

Recommended GPU for local LLM

For local LLM usage, Q8_0 KV cache is clearly the best choice as it will allow you to run much bigger context lengths, and these are the GPUs I recommend.

ModelRTX 5060 TiRTX 5070 TiRTX 5080RTX 5090
BrandASUSASUSGIGABYTEASUS
VRAM (GB)16 GB16 GB16 GB32 GB
VRAM TypeGDDR7GDDR7GDDR7GDDR7
Memory Bandwidth448 GB/s896 GB/s960 GB/s1790 GB/s
Price (USD)HEREHEREHEREHERE

I don’t show any GPUs lower than 16GB VRAM because they are just not worth the cost. Moreover, anything lower than 400 GB/s of memory bandwidth isn’t worth it either.