Roadmap — Gods from the Machine

Phase 1: Foundation COMPLETE

Phase 2: Gilgamesh v0.1 — Core Agent COMPLETE

Phase 3: Gilgamesh v0.2 — CLI/MCP/API Duality COMPLETE

Phase 4: Gilgamesh v0.3 — Quality & Polish COMPLETE

Unit tests for all 9 packages (agent, llm, tools, mcp, server, config, context, hooks, session)
CI/CD with GitHub Actions (build + test on push/PR)
Go benchmark tool (cmd/bench/) for model trialing
Model trials documentation (TRIALS.md)
Table-driven tests with edge cases for tools
Graceful shutdown for HTTP server with request timeouts
MCP protocol version negotiation (2025-03-26 + 2024-11-05)
Version bump to v0.3.0

Phase 5: Gilgamesh v0.5 — Skills, Permissions & Multi-language COMPLETE

Configurable tool permissions (allowed_tools/denied_tools)
Multi-language test support (Go, Python, Rust, Zig, Node.js)
Shell completion (bash, zsh, fish)
7 built-in skills embedded in binary (commit, review, explain, fix, refactor, doc, tdd)
3-tier skill precedence: built-in → global → project-local
Version bump to v0.5.0
Memory/context persistence across sessions (.gilgamesh/memory.json)
Project-scoped conversation history (save/resume)
Custom tool registration (.gilgamesh/tools.json)
TDD workflow automation (/tdd built-in skill)

Phase 6: Gilgamesh v0.6 — UI & Developer Experience COMPLETE

Command registry pattern replacing switch statement
Graceful Ctrl+C via context.Context threading
Streaming markdown renderer (headers, bold/italic, lists, blockquotes, code blocks)
Accessible NoColor icon fallbacks
Auto-sized table renderer and context pressure gauge
Error classification with recovery hints
Config validation and environment variable overrides
Spinner with elapsed time display
Terminal color profile detection (NO_COLOR, CLICOLOR, COLORTERM)
243 tests across 12 packages

Phase 7: Gilgamesh v1.0 — Intelligence FUTURE

Phase 8: Broader Gods FUTURE

Running: Model Trialing & Benchmarking ONGOING

Go benchmark tool (cmd/bench/main.go)
Baseline benchmarks: Qwen3.5 2B Q4_K_M, 4B Q8_0
Key findings documented (2B sweet spot, 0.8B rejected)
Enhanced bench suite: config loading, raw llama-bench, JSON output, result persistence
4B Q4_K_M raw trial — same speed as Q8_0, saves 1.6GB disk
Full -all comparison run: 2B vs 4B with summary table
4B Q4_K_M agent bench — outperforms Q8_0, recommend as heavy profile
Thread tuning — 12 threads optimal on 16-core EPYC, 30% better TG
Context length impact — 16K ctx saves 672MB, sufficient for agent
Batch size tuning — b=256 optimal, b=512 regresses
9B Q8_0 agent benchmarks — not worth it, same efficiency as 4B but slower
KV cache quantization — q4_0 saves 5-7% RAM, no quality loss
Trial IQ4_XS / IQ3_M quants — smaller memory footprint
New model families — Phi-4, Gemma 3, others
Speculative decoding — draft model (0.8B) + verify (4B)
Multi-model routing — simple tasks → 2B, complex → 4B

Principles

Every milestone must satisfy: