Architecture
The Duality Principle
Every “god” in the pantheon exposes its capabilities through three transport layers. The tools are the fundamental unit — CLI, MCP, and HTTP are just different transports over the same tool registry.
A tool that reads a file works identically whether invoked by a human typing in the REPL, another AI agent via MCP, or a script via HTTP.
Human users ──── CLI (terminal REPL)
│
Other agents ──── MCP (JSON-RPC stdio) ──── Tool Registry ──── Filesystem
│ Shell
Programs ──── HTTP (REST/SSE) ─┘ Network
Why Three Interfaces?
| Interface | Transport | Consumer | Use Case |
|---|---|---|---|
| CLI | Terminal stdin/stdout | Humans | Interactive coding, debugging, exploration |
| MCP | JSON-RPC 2.0 over stdio | AI agents | Claude Desktop, other coding agents |
| HTTP | REST + SSE | Programs | Automation, CI/CD, web UIs, scripts |
The same tool registry powers all three. No capability is exclusive to any interface.
Gilgamesh Architecture
Package Map
gilgamesh/
├── main.go Entry point, REPL, subcommand dispatch
├── agent/
│ ├── agent.go Core loop: prompt → LLM → tool → repeat
│ │ Run()/RunWithContext() for CLI, RunWithEvents() for HTTP
│ └── prompt.go TDD-first system prompt (~300 tokens)
├── llm/
│ └── client.go OpenAI-compatible streaming SSE client
├── tools/
│ ├── registry.go Tool registration, dispatch, enumeration
│ ├── read.go Read files (offset/limit, numbered lines)
│ ├── write.go Create/overwrite files (auto-mkdir)
│ ├── edit.go Find-and-replace (unique match required)
│ ├── bash.go Shell execution (120s timeout, 10K cap)
│ ├── grep.go Content search (regex, 50 match cap)
│ ├── glob.go File pattern matching (100 file cap)
│ └── test.go Multi-language test runner (Go, Python, Rust, Zig, Node)
├── mcp/
│ ├── protocol.go JSON-RPC 2.0 + MCP types
│ └── server.go Stdio MCP server
├── server/
│ └── server.go HTTP API server
├── ui/ Terminal UI (color, markdown, tables, gauges, errors, commands)
├── config/ JSON config loader, validation, env var overrides
├── context/ Project context + skills loader (7 built-in via go:embed)
├── memory/ Project-scoped persistent memory
├── hooks/ Pre/post tool execution hooks
├── session/ JSONL session logging + conversation history
└── cmd/bench/ Go model benchmark tool (6-stage pipeline)
Agent Loop
The core of gilgamesh is a loop that sends user input to a local LLM, processes tool calls, and feeds results back.
User Input
│
▼
┌──────────┐
│ System │ ~300 tokens base + project context
│ Prompt │
└────┬─────┘
│
▼
┌──────────┐ ┌──────────┐
│ StreamChat│────▶│ LLM │ local llama.cpp / OpenAI-compatible
│ (SSE) │◀────│ Server │
└────┬─────┘ └──────────┘
│
├── Text content → print to terminal / emit event
│
└── Tool calls → for each:
│
├── Pre-hooks (can block)
├── Registry.Execute(name, args)
├── Post-hooks (observe)
├── Session log
└── Append result → loop back to LLM
(max 15 iterations, loop detection)
Token Budget
The critical constraint for CPU inference. Every token in the system prompt delays the first response.
| Component | Tokens |
|---|---|
| System prompt | ~300 |
| 7 tool definitions | ~800 |
| Project context | ~500 (capped) |
| Total overhead | ~1,600 |
| Typical user message | 50-200 |
| First request | ~1,700-1,800 |
At ~160 tok/s prompt processing (Qwen3.5-2B Q4_K_M, 12 threads), the first response arrives in ~10 seconds. Subsequent turns benefit from KV cache.
MCP Protocol Flow
Client Gilgamesh MCP Server
│ │
│──── initialize ───────────────────▶│
│◀─── serverInfo + capabilities ─────│
│ │
│──── notifications/initialized ────▶│ (no response)
│ │
│──── tools/list ───────────────────▶│
│◀─── 7 tools with inputSchema ─────│
│ │
│──── tools/call {name, args} ──────▶│
│ │ pre-hooks → execute → post-hooks
│◀─── {content: [{type:"text",...}]} │
HTTP API Flow
GET /api/health → {"status":"ok","version":"0.6.0"}
GET /api/tools → [{name, description, parameters}, ...]
POST /api/tools/{name} → {"result":"...", "elapsed":"42µs"}
POST /api/chat → SSE stream of agent events:
data: {"type":"content","content":"..."}
data: {"type":"tool_call","tool":"read","args":{...}}
data: {"type":"tool_result","tool":"read","content":"..."}
data: {"type":"done"}
Benchmarking Infrastructure
Gilgamesh includes a pure Go benchmark suite (cmd/bench/) for trialing local models. It loads profiles from gilgamesh.json, integrates with llama-bench for raw inference metrics, and supports JSON output for historical tracking.
It measures six dimensions:
- Health check — endpoint latency
- Raw inference — llama-bench pp/tg tok/s (auto-detects binaries in
local-ai/bin/) - Minimal prompt — TTFT + generation speed via API
- Tool call — can the model emit valid tool calls?
- One-shot — end-to-end gilgamesh
runresponse - Edit task — full agent loop: create file + edit it
Supports -all (compare all profiles), -raw (raw llama-bench), -json (machine-readable), -save (append to JSON log).
Results and ongoing findings are tracked in TRIALS.md. The quest: find the optimal model + quantization + inference parameters for a responsive, reliable, tool-calling agent running entirely on CPU.
Design Decisions
Zero External Dependencies
Gilgamesh uses Go stdlib only. No cobra, no glamour, no third-party HTTP frameworks. This keeps the binary small (~9.8MB), builds fast, and eliminates supply chain risk.
Streaming-First
All LLM responses are streamed token-by-token via SSE. The agent loop processes tool calls as they arrive, providing perceived responsiveness even on slow CPU inference.
Closure-Based Tools
Each tool is a closure capturing its logic in an Execute func(args json.RawMessage) (string, error) field. This keeps the tool registry generic — any function that takes JSON and returns a string can be a tool.
Hooks for Extensibility
Instead of a plugin system (too complex, too many tokens), hooks run shell commands before/after tool execution. Users get full control over tool behavior without touching agent code.
Skills as Prompt Templates
Skills are markdown files with {{args}} placeholders. They’re injected as the user message, not as system prompt additions. This keeps the token overhead constant regardless of how many skills exist.
Model Profiles
{
"fast": { "name": "qwen3.5-2b", "endpoint": "http://127.0.0.1:8081/v1" },
"default": { "name": "qwen3.5-2b", "endpoint": "http://127.0.0.1:8081/v1" },
"heavy": { "name": "qwen3.5-4b", "endpoint": "http://127.0.0.1:8080/v1" }
}
Switch models mid-session with /model heavy or specify on launch with -m fast.