Research — Gods from the Machine

The Quest

Find the ultimate local coding setup: the best model + quantization + inference parameters for a responsive, reliable, tool-calling coding agent running entirely on CPU. No GPU required. No cloud required. Just raw compute powering intelligent programs.

Every optimization matters when you’re paying per-token in latency instead of dollars.

Hardware

All trials run on consumer-grade hardware — if it works here, it works anywhere.

Component	Spec
CPU	AMD EPYC-Rome 16-core @ 2.0GHz
RAM	30GB
GPU	None
Inference	llama.cpp (CPU-only, AVX2)

Methodology

Gilgamesh includes a pure Go benchmark suite (cmd/bench/) that tests models across 6 stages, from raw inference to full agent edit tasks:

Health check — endpoint latency
Raw inference — llama-bench pp/tg tok/s
Minimal prompt — TTFT + generation speed with tiny prompt
Tool call — can the model emit valid tool calls?
One-shot — end-to-end gilgamesh run with simple question
Edit task — full agent loop: create file + edit it (write + edit tools)

Each trial follows a controlled protocol: stop all servers, wait for clean CPU, start with documented parameters, average multiple runs. See TRIAL_METHODOLOGY.md for the full protocol.

Models Tested

Qwen3.5-0.8B Q4_K_M REJECTED

Fastest raw inference (268 pp, 22.7 tg tok/s)
Frequent tool call loops (repeated identical calls)
Hallucinated tool names and arguments
Loop detection triggers constantly
Cannot reliably follow multi-step instructions

Qwen3.5-2B Q4_K_M DEFAULT

172 tok/s prompt processing — fast enough for interactive use
19.2 tok/s generation — readable streaming output
Tool calling works reliably (glob, read, write, edit, bash)
First response in ~3-8 seconds with 1,600-token overhead
Edit task passes but occasionally fails (SLM reliability)
1.18GB disk, ~2.8GB RAM

Qwen3.5-4B Q4_K_M HEAVY

Quality ceiling — significantly better code generation
Fewer tool calls to complete tasks (2 vs 5-9 for 2B)
Edit task passes consistently — higher reliability
Same speed as Q8_0 (memory bandwidth bottleneck), saves 1.6GB disk
2.54GB disk, ~4.5GB RAM
2.4x slower first response vs 2B — use for quality-critical tasks

Qwen3.5-9B Q8_0 NOT WORTH IT

Same 2-tool-call efficiency as 4B
40-70% slower than 4B for same results
5.7 tok/s TG makes interactive use painful
8.9GB disk, ~10GB RAM

Key Findings

The efficiency insight

The 4B model compensates for slower inference with better planning. It uses fewer tool calls to complete the same task. The net result: 4B edit tasks can actually be faster than 2B when 2B makes many attempts.

Metric	2B Q4_K_M	4B Q4_K_M	9B Q8_0
Minimal prompt	650-840ms	1.7-1.8s	2.2s
Tool call	3.1-3.4s	7.5s	12.5s
One-shot	1-8s	~20s	42.3s
Edit task	34-146s	60-96s	132s
Edit tool calls	3-9	2	2
Edit reliability	PASS (flaky)	PASS	PASS

Token budget is everything

Competitors use 10,000-40,000 token system prompts — completely unusable on CPU. Gilgamesh keeps total overhead under 1,600 tokens:

Component	Tokens
System prompt	~300
Tool definitions	~800
Project context	~500
Total	~1,600

Every 1,000 extra tokens = ~5.8s added latency at 172 tok/s. Context compaction at 12K tokens keeps interactions responsive.

Parameter Tuning

Thread Count 12 OPTIMAL

PP saturates at 12 threads — 16 gives only 0-3% more
TG degrades 22-30% at 16 threads (contention + NUMA overhead)
12 threads is the sweet spot for both PP and TG on 16-core EPYC

Context Length 16K SUFFICIENT

65K ctx adds ~672MB RAM — gilgamesh never needs >12K (compaction threshold)
16K ctx saves ~500MB while covering all practical agent usage

Batch Size 256 OPTIMAL

b=256 is 2-3% faster PP than b=512
b=512 regresses on both PP and TG (cache pressure)
b=32 is 20-25% slower — too small for efficient SIMD

KV Cache Quantization q4_0 SAVES 5-7%

q4_0 KV saves 145MB (2B) / 215MB (4B) with no quality degradation
Tool calling and edit tasks work identically to f16 KV
Dual-model serving with q4_0 KV: 6.3GB total (21% of 30GB RAM)

Optimal server configuration

--threads 12 --ctx-size 16384 --batch-size 256 -ctk q4_0 -ctv q4_0
--temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0
--chat-template-kwargs '{"enable_thinking":false}'

Open Questions

Future Trials TODO

IQ4_XS / IQ3_M quants — smaller memory footprint, quality impact?
New model families — Phi-4, Gemma 3, others within CPU constraints
Multi-model routing — simple tasks to 2B, complex to 4B automatically
Speculative decoding — draft model (0.8B) + verify (4B)

Raw Data

Full benchmark results, detailed tables, and reproducibility instructions:

TRIALS.md — complete trial data and findings
TRIAL_METHODOLOGY.md — controlled benchmark protocol
Benchmark suite — pure Go, go run ./cmd/bench -all