The goal here is that we want to run Qwen 3.5 27B on our local Windows machine with our GPU and serve it to the Docker container via the API. Then we want the Docker container to run the aider benchmarks while using the API to make the calls. This way the benchmarks run in the environment that Aider wants.
Here’s how we’ll do that.
Grab the latest version of llama-cpp (compiled) here
https://github.com/ggml-org/llama.cpp/releases
I used this version because I’m on Win x64 using a Nvidia GPU
https://github.com/ggml-org/llama.cpp/releases/download/b8407/llama-b8407-bin-win-cuda-13.1-x64.zip
Extract it to a folder.
You’ll also need the CUDA DLLs, extract them into the same folder that you extracted llama into.
https://github.com/ggml-org/llama.cpp/releases/download/b8407/cudart-llama-bin-win-cuda-13.1-x64.zip
Grab the the model here:
https://huggingface.co/unsloth/Qwen3.5-27B-GGUF
I used the UD-Q5_K_XL
https://huggingface.co/unsloth/Qwen3.5-27B-GGUF/blob/main/Qwen3.5-27B-UD-Q5_K_XL.gguf
You don’t necessarily need the model in the same folder because you can just supply the full path to it like I do.
Now you want open a Powershell window and cd into the folder where you extracted llama-server and the CUDA DLLs.
cd C:\Users\user1\Downloads\llama-b8392-bin-win-cuda-13.1-x64 .\llama-server.exe -m E:\lm-models\unsloth\Qwen3.5-27B-GGUF\Qwen3.5-27B-UD-Q5_K_XL.gguf --no-mmproj --no-mmap --jinja --threads 8 --cache-type-k q8_0 --cache-type-v q8_0 --flash-attn on --ctx-size 80000 -kvu --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 --presence-penalty 0.0 --repeat-penalty 1.0 --host 0.0.0.0
Caveats:
- I used .\ to run the command because this is Powershell.
- I set context size to 80k, you might need to lower yours if you have VRAM issues. This is not a model you want to offload to RAM/CPU.
- I quantized the KV cache to Q8_0 because some benchmarks show that it’s very safe, I don’t ever go lower than that.
- I used the sampling parameters (temp, top-p, etc) found on the huggingface page that Qwen recommends for this model in thinking mode.
- I set the host to 0.0.0.0 because we want Docker to be able to reach it.
Get Docker for Windows and install it
https://docs.docker.com/desktop/setup/install/windows-install/
here’s the link for Windows X64
https://desktop.docker.com/win/main/amd64/Docker%20Desktop%20Installer.exe?utm_source=docker&utm_medium=webreferral&utm_campaign=docs-driven-download-win-amd64
Once you’ve installed Docker for Windows, move on to setting up the container and benchmarks
Note: I am in a folder on my E drive called llm-benchmark, you will need to change this.
cd E:\llm-benchmark
git clone https://github.com/Aider-AI/aider.git
cd aider
mkdir tmp.benchmarks
git clone https://github.com/Aider-AI/polyglot-benchmark tmp.benchmarks/polyglot-benchmark
cd E:\llm-benchmark\aider
docker run --rm -it -e AIDER_DOCKER=1 -e OPENAI_API_BASE=http://host.docker.internal:8080/v1 -e OPENAI_API_KEY=dummy --add-host=host.docker.internal:host-gateway -v "${PWD}:/aider" -w /aider aider-benchmark bash
That will put you in the Linux shell of the container
Now, test you can reach the LLAMA-SERVER API.
curl http://host.docker.internal:8080/v1/models
You should get back JSON response of the models available.
Next, test connecting to the model with Aider.
aider --model openai/Qwen3.5-27B-UD-Q5_K_XL.gguf
type “test” or something similar and wait for response.
If you got a response press CTRL + C to exit.
Now test bench within the container.
./benchmark/benchmark.py smoke-test \ --model openai/Qwen3.5-27B-UD-Q5_K_XL.gguf \ --edit-format whole \ --threads 1 \ --num-tests 1 \ --exercises-dir polyglot-benchmark
The benchmark may take a while to run.
If that fails with error “/usr/bin/env: ‘python3\r’: No such file or directory”
You cloned the aider files on windows first before creating the container and you need this:
apt-get update && apt-get install -y dos2unix dos2unix benchmark/benchmark.py
Continue with the full benchmark. In this case I run 3 Rust tests.
./benchmark/benchmark.py my-local-run \ --model openai/Qwen3.5-27B-UD-Q5_K_XL.gguf \ --edit-format whole \ --threads 1 \ --keywords "rust" \ --num-tests 5 \ --exercises-dir polyglot-benchmark
This next part is optional.
Note that Aider runs with a custom temperature setting which you may not want for a local model. We want to use the temp set by llama-server (llama-cpp).
install nano so you can edit files
apt install nano -y
create a yml config file with the name “.custom-model-settings.yml”
nano .custom-model-settings.yml
Paste in the following
- name: openai/Qwen3.5-27B-UD-Q5_K_XL.gguf edit_format: whole weak_model_name: openai/Qwen3.5-27B-UD-Q5_K_XL.gguf use_repo_map: true use_temperature: false
And press CTRL + X to save.
Now run the benchmark with the settings file specified.
./benchmark/benchmark.py UD-Q5_K_XL-KV-Q8-Q8 \ --model openai/Qwen3.5-27B-UD-Q5_K_XL.gguf \ --edit-format whole \ --threads 1 \ --keywords "rust" \ --num-tests 1 \ --read-model-settings .custom-model-settings.yml \ --exercises-dir polyglot-benchmark
Note if you previously ran this benchmark you may need to add –name at the end to overwrite it or change the name.
That’s it, you’re done!