Running Agentic AI in Production
· ~8 min readChatGPT in a browser tab is stage one. An agent that reads your codebase, identifies a bug, researches the API docs, proposes a fix, and runs the tests — all without you typing a second prompt — is stage three. Most organisations are still at stage one. Here's the architecture that gets you to stage three, built on real infrastructure running across nine hosts.
Three Stages of AI Integration
| Stage | What | Example |
|---|---|---|
| Chat | One prompt, one response | ChatGPT in the browser |
| Agent | Tool use, context-aware | OpenCode CLI |
| Multi-Agent | Orchestration, delegation | Oh-My-OpenAgent + ACP |
The jump from chat to agent is about giving the model tools — file access, terminal commands, web search. The jump from agent to multi-agent is about giving one agent the ability to spawn and coordinate other agents, each with its own model, context, and specialisation. That second jump is where the engineering gets interesting.
Production Requirements
Production AI isn't about prompt engineering. It's about infrastructure. Three requirements drive every architectural decision:
- Secure — Data sovereignty, GDPR compliance. Your code and prompts don't leave your network unless you decide they do.
- Controllable — Every agent decision is traceable. You can audit which model served which request, at what cost, with what latency.
- Integrable — APIs, existing systems, CI/CD pipelines. The AI layer doesn't exist in isolation.
What this looks like in practice: a local GPU cluster running Qwen3.5 397B and Gemma 4 26B for sovereign inference, a LiteLLM proxy presenting 25+ models through one endpoint, a Neo4j knowledge graph indexing four code repositories, and an ACP-based orchestration layer that can drive 90+ projects from a single CLI.
LiteLLM: The Model Abstraction Layer
Every provider has its own API surface. SAIA uses OpenAI-compatible endpoints with rate limits that reset at midnight. Z.ai serves GLM models through a Chinese CDN with different latency characteristics. Local GPU nodes have no rate limits but limited availability when the VPN drops. LiteLLM Proxy sits at http://127.0.0.1:4000/v1 and presents a unified OpenAI-compatible interface to all of them.
{
"provider": {
"litellm": {
"models": {
"glm-5-turbo": { "name": "GLM 5 Turbo (SAIA)" },
"saia/gpt-oss-120b": { "name": "SAIA GPT OSS 120B" },
"qwen3.5-397b": { "name": "Qwen3.5 397B (local)" }
},
"options": {
"baseURL": "http://127.0.0.1:4000/v1",
"timeout": 120000,
"maxRetries": 5
}
}
}
}
```text
The proxy handles automatic fallback (SAIA goes down, traffic routes to local GPUs with zero code changes), unified logging (every API call recorded to PostgreSQL with cost, latency, and token counts), Redis caching (identical prompts computed once), and rate limiting (budget per model, per user).
### Agent-to-Model Mapping
Each agent category runs on its optimal model. The orchestrator doesn't know or care which provider serves the request:
| Category | Model | Why |
| ---------- | ------- | ----- |
| ultrabrain | GLM 5 Turbo | Hardest reasoning tasks |
| deep | GLM 4.7 | Complex implementation |
| visual-engineering | Qwen3.5 122B | UI/UX, styling, animation |
| quick | Gemma 4 26B | Trivial fixes (fast, free) |
| writing | GLM 5.1 | Documentation, articles |
Six more categories follow the same pattern. Model swap is a one-line change in the LiteLLM config — no agent code touched.
## OpenCode: The Agent Framework
[OpenCode](https://github.com/anomalyco/opencode) is a CLI agent that reads, writes, and tests code autonomously. It routes through LiteLLM for multi-model support, runs 30+ parallel background agents with context compaction to stay within token limits, and integrates LSP + AST-grep for real code understanding across 25 languages.
Three plugins extend OpenCode's capabilities:
- **Oh-My-OpenAgent** — Multi-agent orchestration (the subject of the next section)
- **DCP** — Context management and persistence
- **Morph** — Runtime configuration switching
The architecture connects models, MCP servers, plugins, and LSP in a single CLI process. Agents spawn as background tasks, each with isolated context, communicating results back to the orchestrator through structured output.
## Oh-My-OpenAgent: Ten Specialised Agents
The orchestrator (Sisyphus) delegates to specialised agents by role. Each agent runs on its optimal model via LiteLLM:
### The Brain: Orchestrator and Consultant Agents
| Agent | Role | Model | Why This Model |
| ------- | ------ | ------- | ---------------- |
| Sisyphus | Orchestrator | GLM 5 Turbo | Strong reasoning for delegation decisions |
| Prometheus | Planner | GLM 4.7 | Structured work breakdowns |
| Oracle | Consultant | GPT OSS 120B | High-IQ read-only analysis |
| Metis | Pre-planning | GLM 5 Turbo | Ambiguity and edge case detection |
| Momus | Reviewer | Qwen3.5 122B | Plan quality assurance |
### The Hands: Worker and Research Agents
| Agent | Role | Model | Why This Model |
| ------- | ------ | ------- | ---------------- |
| Sisyphus-Junior | Worker | Devstral 2 123B | Code generation, cheaper than orchestrator |
| Atlas | Worker | GLM 5 Turbo | Implementation execution |
| Librarian | Reference search | Qwen3.5 35B | External docs and API research |
| Explore | Code search | Gemma 4 26B | Fast local grep (cheap) |
| Multimodal | Vision | GLM 4.6V | Image analysis, screenshots |
### A Real Workflow
User: *"Fix the scheduler bug in the neo4jknowledgebase repo"*
1. **Sisyphus** (GLM 5 Turbo) — Detects bugfix intent, routes to investigation
2. **Explore** (Gemma 4 26B) — Finds scheduler code across the repository
3. **Librarian** (Qwen3.5 35B) — Researches the embedding API that's failing
4. **Oracle** (GPT OSS 120B) — Root cause analysis: model removed from config
5. **Sisyphus-Junior** (Devstral 2 123B) — Implements the fix
6. **Sisyphus** (GLM 5 Turbo) — Verifies build passes
Six different models, roughly fifteen API calls, all through one LiteLLM endpoint. Total cost: a few cents.
## Skills, Plugins, and MCP: Extensibility
MCP (Model Context Protocol) connects agents to external tools:
| MCP Server | Function | Transport |
| ------------ | ---------- | ----------- |
| Inkscape | SVG creation and manipulation | Local |
| CodeGraphContext | Code-indexed knowledge (Neo4j KG) | Local (SSH tunnel) |
| Playwright | Browser automation, screenshots | Built-in |
| Git | Repository operations | Built-in |
| File System | Read/write access to workspace | Built-in |
CodeGraphContext indexes four repositories (131 files, 736 functions) into Neo4j, giving agents queryable knowledge about their own codebase — class hierarchies, call chains, dead code detection.
Skills are reusable workflow templates that encode best practices:
| Skill | Description |
| ------- | ------------- |
| graphwiz-reporter | Autonomous: KG + RSS → research → published article |
| test-driven-development | Write tests first, then implement |
| systematic-debugging | Structured error analysis before proposing fixes |
| dispatching-parallel-agents | Distribute independent tasks in parallel |
| verification-before-completion | Evidence before assertions |
## ACP: The Agent Client Protocol
[ACP](/talks_and_thoughts/agent-client-protocol) (Agent Client Protocol, by Zed Industries) is the LSP moment for AI agents — a JSON-RPC standard for communication between clients and agents. One protocol, any editor, any agent.
```text
Client (Editor/CLI) → JSON-RPC (stdio/HTTP) → Agent (Claude/Codex/Gemini/OpenCode/...)
```text
As of April 2026: 30+ agents, 40+ clients. The practical value is multi-project orchestration:
```bash
# Batch prompt across all Python projects
acp run "check for outdated dependencies" --tags=python
# Autonomous improvement loop with session rotation
acp loop next-graphwiz-ai --max-iter 100 --rotate 25
```text
Session rotation every 25 iterations prevents context saturation — the agent starts fresh but picks up where it left off through persistent state.
## Infrastructure: Ansible Automation
Nine hosts, everything as code. Ansible playbooks handle the full stack:
| Playbook | Function |
| ---------- | ---------- |
| litellm-proxy | Docker Compose: LiteLLM + PostgreSQL + Redis |
| opencode-deploy | Build and install OpenCode from source (Go) |
| opencode-sync | Sync agent configs to all hosts |
| knowledge-graph | Neo4j + CodeGraphContext deployment |
| vpn-hub / vpn-peers | WireGuard mesh networking |
| traefik | Reverse proxy + TLS (Let's Encrypt) |
```bash
ansible-playbook site.yml --limit ai_cluster
```text
The full stack: Traefik terminates TLS and routes to the LiteLLM proxy, which fans out to SAIA, Z.ai, or local GPU nodes. OpenCode agents connect to LiteLLM over the internal network. MCP servers (Inkscape, CodeGraphContext) run as local processes. Everything is reproducible from a single `git pull && ansible-playbook`.
## Three Decoupling Layers
The architecture has three critical decoupling points:
| Layer | Tool | Decouples |
| ------- | ------ | ----------- |
| Models from agents | LiteLLM Proxy | Swap any model without touching agent code |
| Orchestration from implementation | Oh-My-OpenAgent | The planner and the worker are separate processes |
| Agents from editors | ACP | Run the same agent from VS Code, Zed, or the CLI |
Each layer can evolve independently. Upgrade your model roster? Change LiteLLM config. Add a new agent role? Add it to the orchestrator. Switch editors? Install a different ACP client. Nothing breaks because nothing is tightly coupled.
## Resources
- [SAIA Documentation](https://docs.hpc.gwdg.de/services/saia/index.html)
- [OpenCode](https://github.com/anomalyco/opencode)
- [Oh-My-OpenAgent](https://github.com/code-yeongyu/oh-my-openagent/)
- [LiteLLM Proxy](https://docs.litellm.ai/docs/proxy/proxy)
- [ACP Specification](https://agentclientprotocol.com)
- [CodeGraphContext](https://github.com/tobias-weiss-ai-xr/CodeGraphContext)
- [SAIA Plugin for OpenCode](https://codeberg.org/graphwiz-ai/opencode-saia-plugin)
- [Superpowers Skills](https://github.com/obra/superpowers)
- [MCP Protocol](https://modelcontextprotocol.io/)
*Based on a presentation delivered at the HRZ AI Colloquium, April 2026.*