architecture

Architecture

The Duality Principle

Every “god” in the pantheon exposes its capabilities through three transport layers. The tools are the fundamental unit — CLI, MCP, and HTTP are just different transports over the same tool registry.

A tool that reads a file works identically whether invoked by a human typing in the REPL, another AI agent via MCP, or a script via HTTP.

             Human users ──── CLI (terminal REPL)
                                    │
Other agents ──── MCP (JSON-RPC stdio) ──── Tool Registry ──── Filesystem
                                    │                          Shell
     Programs ──── HTTP (REST/SSE) ─┘                          Network
CLI / MCP / API Architecture

Why Three Interfaces?

InterfaceTransportConsumerUse Case
CLITerminal stdin/stdoutHumansInteractive coding, debugging, exploration
MCPJSON-RPC 2.0 over stdioAI agentsClaude Desktop, other coding agents
HTTPREST + SSEProgramsAutomation, CI/CD, web UIs, scripts

The same tool registry powers all three. No capability is exclusive to any interface.

Gilgamesh Architecture

Package Map

gilgamesh/
├── main.go              Entry point, REPL, subcommand dispatch
├── agent/
│   ├── agent.go         Core loop: prompt → LLM → tool → repeat
│   │                    Run()/RunWithContext() for CLI, RunWithEvents() for HTTP
│   └── prompt.go        TDD-first system prompt (~300 tokens)
├── llm/
│   └── client.go        OpenAI-compatible streaming SSE client
├── tools/
│   ├── registry.go      Tool registration, dispatch, enumeration
│   ├── read.go          Read files (offset/limit, numbered lines)
│   ├── write.go         Create/overwrite files (auto-mkdir)
│   ├── edit.go          Find-and-replace (unique match required)
│   ├── bash.go          Shell execution (120s timeout, 10K cap)
│   ├── grep.go          Content search (regex, 50 match cap)
│   ├── glob.go          File pattern matching (100 file cap)
│   └── test.go          Multi-language test runner (Go, Python, Rust, Zig, Node)
├── mcp/
│   ├── protocol.go      JSON-RPC 2.0 + MCP types
│   └── server.go        Stdio MCP server
├── server/
│   └── server.go        HTTP API server
├── ui/                  Terminal UI (color, markdown, tables, gauges, errors, commands)
├── config/              JSON config loader, validation, env var overrides
├── context/             Project context + skills loader (7 built-in via go:embed)
├── memory/              Project-scoped persistent memory
├── hooks/               Pre/post tool execution hooks
├── session/             JSONL session logging + conversation history
└── cmd/bench/           Go model benchmark tool (6-stage pipeline)

Agent Loop

The core of gilgamesh is a loop that sends user input to a local LLM, processes tool calls, and feeds results back.

User Input
    │
    ▼
┌──────────┐
│  System   │  ~300 tokens base + project context
│  Prompt   │
└────┬─────┘
     │
     ▼
┌──────────┐     ┌──────────┐
│ StreamChat│────▶│  LLM     │  local llama.cpp / OpenAI-compatible
│ (SSE)    │◀────│  Server   │
└────┬─────┘     └──────────┘
     │
     ├── Text content → print to terminal / emit event
     │
     └── Tool calls → for each:
             │
             ├── Pre-hooks (can block)
             ├── Registry.Execute(name, args)
             ├── Post-hooks (observe)
             ├── Session log
             └── Append result → loop back to LLM
                 (max 15 iterations, loop detection)

Token Budget

The critical constraint for CPU inference. Every token in the system prompt delays the first response.

ComponentTokens
System prompt~300
7 tool definitions~800
Project context~500 (capped)
Total overhead~1,600
Typical user message50-200
First request~1,700-1,800

At ~160 tok/s prompt processing (Qwen3.5-2B Q4_K_M, 12 threads), the first response arrives in ~10 seconds. Subsequent turns benefit from KV cache.

MCP Protocol Flow

Client                          Gilgamesh MCP Server
  │                                    │
  │──── initialize ───────────────────▶│
  │◀─── serverInfo + capabilities ─────│
  │                                    │
  │──── notifications/initialized ────▶│  (no response)
  │                                    │
  │──── tools/list ───────────────────▶│
  │◀─── 7 tools with inputSchema ─────│
  │                                    │
  │──── tools/call {name, args} ──────▶│
  │         │ pre-hooks → execute → post-hooks
  │◀─── {content: [{type:"text",...}]} │

HTTP API Flow

GET  /api/health          → {"status":"ok","version":"0.6.0"}
GET  /api/tools           → [{name, description, parameters}, ...]
POST /api/tools/{name}    → {"result":"...", "elapsed":"42µs"}
POST /api/chat            → SSE stream of agent events:
                              data: {"type":"content","content":"..."}
                              data: {"type":"tool_call","tool":"read","args":{...}}
                              data: {"type":"tool_result","tool":"read","content":"..."}
                              data: {"type":"done"}

Benchmarking Infrastructure

Gilgamesh includes a pure Go benchmark suite (cmd/bench/) for trialing local models. It loads profiles from gilgamesh.json, integrates with llama-bench for raw inference metrics, and supports JSON output for historical tracking.

It measures six dimensions:

  1. Health check — endpoint latency
  2. Raw inference — llama-bench pp/tg tok/s (auto-detects binaries in local-ai/bin/)
  3. Minimal prompt — TTFT + generation speed via API
  4. Tool call — can the model emit valid tool calls?
  5. One-shot — end-to-end gilgamesh run response
  6. Edit task — full agent loop: create file + edit it

Supports -all (compare all profiles), -raw (raw llama-bench), -json (machine-readable), -save (append to JSON log).

Results and ongoing findings are tracked in TRIALS.md. The quest: find the optimal model + quantization + inference parameters for a responsive, reliable, tool-calling agent running entirely on CPU.

Design Decisions

Zero External Dependencies

Gilgamesh uses Go stdlib only. No cobra, no glamour, no third-party HTTP frameworks. This keeps the binary small (~9.8MB), builds fast, and eliminates supply chain risk.

Streaming-First

All LLM responses are streamed token-by-token via SSE. The agent loop processes tool calls as they arrive, providing perceived responsiveness even on slow CPU inference.

Closure-Based Tools

Each tool is a closure capturing its logic in an Execute func(args json.RawMessage) (string, error) field. This keeps the tool registry generic — any function that takes JSON and returns a string can be a tool.

Hooks for Extensibility

Instead of a plugin system (too complex, too many tokens), hooks run shell commands before/after tool execution. Users get full control over tool behavior without touching agent code.

Skills as Prompt Templates

Skills are markdown files with {{args}} placeholders. They’re injected as the user message, not as system prompt additions. This keeps the token overhead constant regardless of how many skills exist.

Model Profiles

{
  "fast":    { "name": "qwen3.5-2b", "endpoint": "http://127.0.0.1:8081/v1" },
  "default": { "name": "qwen3.5-2b", "endpoint": "http://127.0.0.1:8081/v1" },
  "heavy":   { "name": "qwen3.5-4b", "endpoint": "http://127.0.0.1:8080/v1" }
}

Switch models mid-session with /model heavy or specify on launch with -m fast.