Qwen3.x and LLAMA.CPP – How To Extend Context Window Past 260k

Normally Qwen3.x (3.5 and 3.6) models have a limit of about 260k context. There are many scenarios where it would be advantageous to increase this to around 300 or 400k. One primary use case is having the model ingest a ton of files before working on a problem (usually source code documents).

Here are the parameters required for llama-cpp with an example of increasing the context window size to 300k.

./llama-server -m model.gguf --port 8080 --host 0.0.0.0 \
--rope-scaling yarn --rope-scale 1.14441 --yarn-orig-ctx 262144 \  
--override-kv qwen35.context_length=int:1000000 --ctx-size 300000 \

Note that you MUST change qwen35 to qwen35moe if you are running MoE models like 35B-A3B.

E.g.

--override-kv qwen35.context_length=int:1000000  
--override-kv qwen35moe.context_length=int:1000000

The rope scale is just target context / model default max context. For our goal of 300k context that’s:

300,000/262144 = 1.4441

I left the override-kv at 1 mil because it has no effect on its own, so it’s save to make it high.

Related