research

Research

The Quest

Find the ultimate local coding setup: the best model + quantization + inference parameters for a responsive, reliable, tool-calling coding agent running entirely on CPU. No GPU required. No cloud required. Just raw compute powering intelligent programs.

Every optimization matters when you’re paying per-token in latency instead of dollars.

Hardware

All trials run on consumer-grade hardware — if it works here, it works anywhere.

ComponentSpec
CPUAMD EPYC-Rome 16-core @ 2.0GHz
RAM30GB
GPUNone
Inferencellama.cpp (CPU-only, AVX2)

Methodology

Gilgamesh includes a pure Go benchmark suite (cmd/bench/) that tests models across 6 stages, from raw inference to full agent edit tasks:

  1. Health check — endpoint latency
  2. Raw inference — llama-bench pp/tg tok/s
  3. Minimal prompt — TTFT + generation speed with tiny prompt
  4. Tool call — can the model emit valid tool calls?
  5. One-shot — end-to-end gilgamesh run with simple question
  6. Edit task — full agent loop: create file + edit it (write + edit tools)

Each trial follows a controlled protocol: stop all servers, wait for clean CPU, start with documented parameters, average multiple runs. See TRIAL_METHODOLOGY.md for the full protocol.


Models Tested

Qwen3.5-0.8B Q4_K_M REJECTED
  • Fastest raw inference (268 pp, 22.7 tg tok/s)
  • Frequent tool call loops (repeated identical calls)
  • Hallucinated tool names and arguments
  • Loop detection triggers constantly
  • Cannot reliably follow multi-step instructions
Qwen3.5-2B Q4_K_M DEFAULT
  • 172 tok/s prompt processing — fast enough for interactive use
  • 19.2 tok/s generation — readable streaming output
  • Tool calling works reliably (glob, read, write, edit, bash)
  • First response in ~3-8 seconds with 1,600-token overhead
  • Edit task passes but occasionally fails (SLM reliability)
  • 1.18GB disk, ~2.8GB RAM
Qwen3.5-4B Q4_K_M HEAVY
  • Quality ceiling — significantly better code generation
  • Fewer tool calls to complete tasks (2 vs 5-9 for 2B)
  • Edit task passes consistently — higher reliability
  • Same speed as Q8_0 (memory bandwidth bottleneck), saves 1.6GB disk
  • 2.54GB disk, ~4.5GB RAM
  • 2.4x slower first response vs 2B — use for quality-critical tasks
Qwen3.5-9B Q8_0 NOT WORTH IT
  • Same 2-tool-call efficiency as 4B
  • 40-70% slower than 4B for same results
  • 5.7 tok/s TG makes interactive use painful
  • 8.9GB disk, ~10GB RAM

Key Findings

The efficiency insight

The 4B model compensates for slower inference with better planning. It uses fewer tool calls to complete the same task. The net result: 4B edit tasks can actually be faster than 2B when 2B makes many attempts.

Metric2B Q4_K_M4B Q4_K_M9B Q8_0
Minimal prompt650-840ms1.7-1.8s2.2s
Tool call3.1-3.4s7.5s12.5s
One-shot1-8s~20s42.3s
Edit task34-146s60-96s132s
Edit tool calls3-922
Edit reliabilityPASS (flaky)PASSPASS

Token budget is everything

Competitors use 10,000-40,000 token system prompts — completely unusable on CPU. Gilgamesh keeps total overhead under 1,600 tokens:

ComponentTokens
System prompt~300
Tool definitions~800
Project context~500
Total~1,600

Every 1,000 extra tokens = ~5.8s added latency at 172 tok/s. Context compaction at 12K tokens keeps interactions responsive.


Parameter Tuning

Thread Count 12 OPTIMAL
  • PP saturates at 12 threads — 16 gives only 0-3% more
  • TG degrades 22-30% at 16 threads (contention + NUMA overhead)
  • 12 threads is the sweet spot for both PP and TG on 16-core EPYC
Context Length 16K SUFFICIENT
  • 65K ctx adds ~672MB RAM — gilgamesh never needs >12K (compaction threshold)
  • 16K ctx saves ~500MB while covering all practical agent usage
Batch Size 256 OPTIMAL
  • b=256 is 2-3% faster PP than b=512
  • b=512 regresses on both PP and TG (cache pressure)
  • b=32 is 20-25% slower — too small for efficient SIMD
KV Cache Quantization q4_0 SAVES 5-7%
  • q4_0 KV saves 145MB (2B) / 215MB (4B) with no quality degradation
  • Tool calling and edit tasks work identically to f16 KV
  • Dual-model serving with q4_0 KV: 6.3GB total (21% of 30GB RAM)

Optimal server configuration

--threads 12 --ctx-size 16384 --batch-size 256 -ctk q4_0 -ctv q4_0
--temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0
--chat-template-kwargs '{"enable_thinking":false}'

Open Questions

Future Trials TODO
  • IQ4_XS / IQ3_M quants — smaller memory footprint, quality impact?
  • New model families — Phi-4, Gemma 3, others within CPU constraints
  • Multi-model routing — simple tasks to 2B, complex to 4B automatically
  • Speculative decoding — draft model (0.8B) + verify (4B)

Raw Data

Full benchmark results, detailed tables, and reproducibility instructions: