Research
The Quest
Find the ultimate local coding setup: the best model + quantization + inference parameters for a responsive, reliable, tool-calling coding agent running entirely on CPU. No GPU required. No cloud required. Just raw compute powering intelligent programs.
Every optimization matters when you’re paying per-token in latency instead of dollars.
Hardware
All trials run on consumer-grade hardware — if it works here, it works anywhere.
| Component | Spec |
|---|---|
| CPU | AMD EPYC-Rome 16-core @ 2.0GHz |
| RAM | 30GB |
| GPU | None |
| Inference | llama.cpp (CPU-only, AVX2) |
Methodology
Gilgamesh includes a pure Go benchmark suite (cmd/bench/) that tests models across 6 stages, from raw inference to full agent edit tasks:
- Health check — endpoint latency
- Raw inference — llama-bench pp/tg tok/s
- Minimal prompt — TTFT + generation speed with tiny prompt
- Tool call — can the model emit valid tool calls?
- One-shot — end-to-end gilgamesh
runwith simple question - Edit task — full agent loop: create file + edit it (write + edit tools)
Each trial follows a controlled protocol: stop all servers, wait for clean CPU, start with documented parameters, average multiple runs. See TRIAL_METHODOLOGY.md for the full protocol.
Models Tested
- Fastest raw inference (268 pp, 22.7 tg tok/s)
- Frequent tool call loops (repeated identical calls)
- Hallucinated tool names and arguments
- Loop detection triggers constantly
- Cannot reliably follow multi-step instructions
- 172 tok/s prompt processing — fast enough for interactive use
- 19.2 tok/s generation — readable streaming output
- Tool calling works reliably (glob, read, write, edit, bash)
- First response in ~3-8 seconds with 1,600-token overhead
- Edit task passes but occasionally fails (SLM reliability)
- 1.18GB disk, ~2.8GB RAM
- Quality ceiling — significantly better code generation
- Fewer tool calls to complete tasks (2 vs 5-9 for 2B)
- Edit task passes consistently — higher reliability
- Same speed as Q8_0 (memory bandwidth bottleneck), saves 1.6GB disk
- 2.54GB disk, ~4.5GB RAM
- 2.4x slower first response vs 2B — use for quality-critical tasks
- Same 2-tool-call efficiency as 4B
- 40-70% slower than 4B for same results
- 5.7 tok/s TG makes interactive use painful
- 8.9GB disk, ~10GB RAM
Key Findings
The efficiency insight
The 4B model compensates for slower inference with better planning. It uses fewer tool calls to complete the same task. The net result: 4B edit tasks can actually be faster than 2B when 2B makes many attempts.
| Metric | 2B Q4_K_M | 4B Q4_K_M | 9B Q8_0 |
|---|---|---|---|
| Minimal prompt | 650-840ms | 1.7-1.8s | 2.2s |
| Tool call | 3.1-3.4s | 7.5s | 12.5s |
| One-shot | 1-8s | ~20s | 42.3s |
| Edit task | 34-146s | 60-96s | 132s |
| Edit tool calls | 3-9 | 2 | 2 |
| Edit reliability | PASS (flaky) | PASS | PASS |
Token budget is everything
Competitors use 10,000-40,000 token system prompts — completely unusable on CPU. Gilgamesh keeps total overhead under 1,600 tokens:
| Component | Tokens |
|---|---|
| System prompt | ~300 |
| Tool definitions | ~800 |
| Project context | ~500 |
| Total | ~1,600 |
Every 1,000 extra tokens = ~5.8s added latency at 172 tok/s. Context compaction at 12K tokens keeps interactions responsive.
Parameter Tuning
- PP saturates at 12 threads — 16 gives only 0-3% more
- TG degrades 22-30% at 16 threads (contention + NUMA overhead)
- 12 threads is the sweet spot for both PP and TG on 16-core EPYC
- 65K ctx adds ~672MB RAM — gilgamesh never needs >12K (compaction threshold)
- 16K ctx saves ~500MB while covering all practical agent usage
- b=256 is 2-3% faster PP than b=512
- b=512 regresses on both PP and TG (cache pressure)
- b=32 is 20-25% slower — too small for efficient SIMD
- q4_0 KV saves 145MB (2B) / 215MB (4B) with no quality degradation
- Tool calling and edit tasks work identically to f16 KV
- Dual-model serving with q4_0 KV: 6.3GB total (21% of 30GB RAM)
Optimal server configuration
--threads 12 --ctx-size 16384 --batch-size 256 -ctk q4_0 -ctv q4_0
--temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0
--chat-template-kwargs '{"enable_thinking":false}'
Open Questions
- IQ4_XS / IQ3_M quants — smaller memory footprint, quality impact?
- New model families — Phi-4, Gemma 3, others within CPU constraints
- Multi-model routing — simple tasks to 2B, complex to 4B automatically
- Speculative decoding — draft model (0.8B) + verify (4B)
Raw Data
Full benchmark results, detailed tables, and reproducibility instructions:
- TRIALS.md — complete trial data and findings
- TRIAL_METHODOLOGY.md — controlled benchmark protocol
- Benchmark suite — pure Go,
go run ./cmd/bench -all