Qwen3.5 27B Q8 – KV Cache Benchmarks BF16 vs F16 vs Q8_0

If you’re curious about how much KV Cache quantization affects Qwen3.5 27B, take a look at the table below.

The model used in all of these benchmarks is Unsloth’s Q8_K_XL.

Table of Contents

KV Cache BF 16 vs F16 vs Q8_0

KV Cache Type	Mean PPL(Q)	ΔPPL (Q – base)	PPL Ratio	ln Ratio	Mean KLD	Max KLD	RMS Δp (%)	Same Top-p (%)
BF16	6.8653 ± 0.04470	—	—	—	—	—	—	—
F16	6.866214 ± 0.044707	0.004435 ± 0.001083	1.000646 ± 0.000158	0.000646 ± 0.000157	0.000884 ± 0.000067	5.223969	0.873 ± 0.041	98.796 ± 0.028
Q8_0	6.864972 ± 0.044693	0.003193 ± 0.001113	1.000465 ± 0.000162	0.000465 ± 0.000162	0.000974 ± 0.000068	4.321135	0.930 ± 0.048	98.720 ± 0.029

These benchmarks were ran using llama-perplexity against the WikiText dataset at 512 tokens.

So what are the results of the benchmark?

BF16 is clearly better than F16 and Q8_0. And if you’re going to run F16, you have no reason not to run BF16 unless your hardware doesn’t support it.

But what about F16 vs Q8_0? They are very similar with F16 being just a very tiny bit better. Note that the margin of error in these stats (the ± symbol) means that most of these values (F16 vs Q8_0) are indistinguishable from each other.

Recommended GPU for local LLM

For local LLM usage, Q8_0 KV cache is clearly the best choice as it will allow you to run much bigger context lengths, and these are the GPUs I recommend.

Model	RTX 5060 Ti	RTX 5070 Ti	RTX 5080	RTX 5090
Brand	ASUS	ASUS	GIGABYTE	ASUS
VRAM (GB)	16 GB	16 GB	16 GB	32 GB
VRAM Type	GDDR7	GDDR7	GDDR7	GDDR7
Memory Bandwidth	448 GB/s	896 GB/s	960 GB/s	1790 GB/s
Price (USD)	HERE	HERE	HERE	HERE

I don’t show any GPUs lower than 16GB VRAM because they are just not worth the cost. Moreover, anything lower than 400 GB/s of memory bandwidth isn’t worth it either.

KV Cache BF 16 vs F16 vs Q8_0

So what are the results of the benchmark?

Recommended GPU for local LLM

Related