Running Agentic AI in Production | Thoughts & Talks

ChatGPT in a browser tab is stage one. An agent that reads your codebase, identifies a bug, researches the API docs, proposes a fix, and runs the tests — all without you typing a second prompt — is stage three. Most organisations are still at stage one. Here's the architecture that gets you to stage three, built on real infrastructure running across nine hosts.

Three Stages of AI Integration

Stage	What	Example
Chat	One prompt, one response	ChatGPT in the browser
Agent	Tool use, context-aware	OpenCode CLI
Multi-Agent	Orchestration, delegation	Oh-My-OpenAgent + ACP

The jump from chat to agent is about giving the model tools — file access, terminal commands, web search. The jump from agent to multi-agent is about giving one agent the ability to spawn and coordinate other agents, each with its own model, context, and specialisation. That second jump is where the engineering gets interesting.

Production Requirements

Production AI isn't about prompt engineering. It's about infrastructure. Three requirements drive every architectural decision:

Secure — Data sovereignty, GDPR compliance. Your code and prompts don't leave your network unless you decide they do.
Controllable — Every agent decision is traceable. You can audit which model served which request, at what cost, with what latency.
Integrable — APIs, existing systems, CI/CD pipelines. The AI layer doesn't exist in isolation.

What this looks like in practice: a local GPU cluster running Qwen3.5 397B and Gemma 4 26B for sovereign inference, a LiteLLM proxy presenting 25+ models through one endpoint, a Neo4j knowledge graph indexing four code repositories, and an ACP-based orchestration layer that can drive 90+ projects from a single CLI.

LiteLLM: The Model Abstraction Layer

Every provider has its own API surface. SAIA uses OpenAI-compatible endpoints with rate limits that reset at midnight. Z.ai serves GLM models through a Chinese CDN with different latency characteristics. Local GPU nodes have no rate limits but limited availability when the VPN drops. LiteLLM Proxy sits at http://127.0.0.1:4000/v1 and presents a unified OpenAI-compatible interface to all of them.

{
  "provider": {
    "litellm": {
      "models": {
        "glm-5-turbo": { "name": "GLM 5 Turbo (SAIA)" },
        "saia/gpt-oss-120b": { "name": "SAIA GPT OSS 120B" },
        "qwen3.5-397b": { "name": "Qwen3.5 397B (local)" }
      },
      "options": {
        "baseURL": "http://127.0.0.1:4000/v1",
        "timeout": 120000,
        "maxRetries": 5
      }
    }
  }
}
```text

The proxy handles automatic fallback (SAIA goes down, traffic routes to local GPUs with zero code changes), unified logging (every API call recorded to PostgreSQL with cost, latency, and token counts), Redis caching (identical prompts computed once), and rate limiting (budget per model, per user).

### Agent-to-Model Mapping

Each agent category runs on its optimal model. The orchestrator doesn't know or care which provider serves the request:

| Category | Model | Why |
| ---------- | ------- | ----- |
| ultrabrain | GLM 5 Turbo | Hardest reasoning tasks |
| deep | GLM 4.7 | Complex implementation |
| visual-engineering | Qwen3.5 122B | UI/UX, styling, animation |
| quick | Gemma 4 26B | Trivial fixes (fast, free) |
| writing | GLM 5.1 | Documentation, articles |

Six more categories follow the same pattern. Model swap is a one-line change in the LiteLLM config — no agent code touched.

## OpenCode: The Agent Framework

[OpenCode](https://github.com/anomalyco/opencode) is a CLI agent that reads, writes, and tests code autonomously. It routes through LiteLLM for multi-model support, runs 30+ parallel background agents with context compaction to stay within token limits, and integrates LSP + AST-grep for real code understanding across 25 languages.

Three plugins extend OpenCode's capabilities:

- **Oh-My-OpenAgent** — Multi-agent orchestration (the subject of the next section)
- **DCP** — Context management and persistence
- **Morph** — Runtime configuration switching

The architecture connects models, MCP servers, plugins, and LSP in a single CLI process. Agents spawn as background tasks, each with isolated context, communicating results back to the orchestrator through structured output.

## Oh-My-OpenAgent: Ten Specialised Agents

The orchestrator (Sisyphus) delegates to specialised agents by role. Each agent runs on its optimal model via LiteLLM:

### The Brain: Orchestrator and Consultant Agents

| Agent | Role | Model | Why This Model |
| ------- | ------ | ------- | ---------------- |
| Sisyphus | Orchestrator | GLM 5 Turbo | Strong reasoning for delegation decisions |
| Prometheus | Planner | GLM 4.7 | Structured work breakdowns |
| Oracle | Consultant | GPT OSS 120B | High-IQ read-only analysis |
| Metis | Pre-planning | GLM 5 Turbo | Ambiguity and edge case detection |
| Momus | Reviewer | Qwen3.5 122B | Plan quality assurance |

### The Hands: Worker and Research Agents

| Agent | Role | Model | Why This Model |
| ------- | ------ | ------- | ---------------- |
| Sisyphus-Junior | Worker | Devstral 2 123B | Code generation, cheaper than orchestrator |
| Atlas | Worker | GLM 5 Turbo | Implementation execution |
| Librarian | Reference search | Qwen3.5 35B | External docs and API research |
| Explore | Code search | Gemma 4 26B | Fast local grep (cheap) |
| Multimodal | Vision | GLM 4.6V | Image analysis, screenshots |

### A Real Workflow

User: *"Fix the scheduler bug in the neo4jknowledgebase repo"*

1. **Sisyphus** (GLM 5 Turbo) — Detects bugfix intent, routes to investigation
2. **Explore** (Gemma 4 26B) — Finds scheduler code across the repository
3. **Librarian** (Qwen3.5 35B) — Researches the embedding API that's failing
4. **Oracle** (GPT OSS 120B) — Root cause analysis: model removed from config
5. **Sisyphus-Junior** (Devstral 2 123B) — Implements the fix
6. **Sisyphus** (GLM 5 Turbo) — Verifies build passes

Six different models, roughly fifteen API calls, all through one LiteLLM endpoint. Total cost: a few cents.

## Skills, Plugins, and MCP: Extensibility

MCP (Model Context Protocol) connects agents to external tools:

| MCP Server | Function | Transport |
| ------------ | ---------- | ----------- |
| Inkscape | SVG creation and manipulation | Local |
| CodeGraphContext | Code-indexed knowledge (Neo4j KG) | Local (SSH tunnel) |
| Playwright | Browser automation, screenshots | Built-in |
| Git | Repository operations | Built-in |
| File System | Read/write access to workspace | Built-in |

CodeGraphContext indexes four repositories (131 files, 736 functions) into Neo4j, giving agents queryable knowledge about their own codebase — class hierarchies, call chains, dead code detection.

Skills are reusable workflow templates that encode best practices:

| Skill | Description |
| ------- | ------------- |
| graphwiz-reporter | Autonomous: KG + RSS → research → published article |
| test-driven-development | Write tests first, then implement |
| systematic-debugging | Structured error analysis before proposing fixes |
| dispatching-parallel-agents | Distribute independent tasks in parallel |
| verification-before-completion | Evidence before assertions |

## ACP: The Agent Client Protocol

[ACP](/talks_and_thoughts/agent-client-protocol) (Agent Client Protocol, by Zed Industries) is the LSP moment for AI agents — a JSON-RPC standard for communication between clients and agents. One protocol, any editor, any agent.

```text
Client (Editor/CLI) → JSON-RPC (stdio/HTTP) → Agent (Claude/Codex/Gemini/OpenCode/...)
```text

As of April 2026: 30+ agents, 40+ clients. The practical value is multi-project orchestration:

```bash
# Batch prompt across all Python projects
acp run "check for outdated dependencies" --tags=python

# Autonomous improvement loop with session rotation
acp loop next-graphwiz-ai --max-iter 100 --rotate 25
```text

Session rotation every 25 iterations prevents context saturation — the agent starts fresh but picks up where it left off through persistent state.

## Infrastructure: Ansible Automation

Nine hosts, everything as code. Ansible playbooks handle the full stack:

| Playbook | Function |
| ---------- | ---------- |
| litellm-proxy | Docker Compose: LiteLLM + PostgreSQL + Redis |
| opencode-deploy | Build and install OpenCode from source (Go) |
| opencode-sync | Sync agent configs to all hosts |
| knowledge-graph | Neo4j + CodeGraphContext deployment |
| vpn-hub / vpn-peers | WireGuard mesh networking |
| traefik | Reverse proxy + TLS (Let's Encrypt) |

```bash
ansible-playbook site.yml --limit ai_cluster
```text

The full stack: Traefik terminates TLS and routes to the LiteLLM proxy, which fans out to SAIA, Z.ai, or local GPU nodes. OpenCode agents connect to LiteLLM over the internal network. MCP servers (Inkscape, CodeGraphContext) run as local processes. Everything is reproducible from a single `git pull && ansible-playbook`.

## Three Decoupling Layers

The architecture has three critical decoupling points:

| Layer | Tool | Decouples |
| ------- | ------ | ----------- |
| Models from agents | LiteLLM Proxy | Swap any model without touching agent code |
| Orchestration from implementation | Oh-My-OpenAgent | The planner and the worker are separate processes |
| Agents from editors | ACP | Run the same agent from VS Code, Zed, or the CLI |

Each layer can evolve independently. Upgrade your model roster? Change LiteLLM config. Add a new agent role? Add it to the orchestrator. Switch editors? Install a different ACP client. Nothing breaks because nothing is tightly coupled.

## Resources

- [SAIA Documentation](https://docs.hpc.gwdg.de/services/saia/index.html)
- [OpenCode](https://github.com/anomalyco/opencode)
- [Oh-My-OpenAgent](https://github.com/code-yeongyu/oh-my-openagent/)
- [LiteLLM Proxy](https://docs.litellm.ai/docs/proxy/proxy)
- [ACP Specification](https://agentclientprotocol.com)
- [CodeGraphContext](https://github.com/tobias-weiss-ai-xr/CodeGraphContext)
- [SAIA Plugin for OpenCode](https://codeberg.org/graphwiz-ai/opencode-saia-plugin)
- [Superpowers Skills](https://github.com/obra/superpowers)
- [MCP Protocol](https://modelcontextprotocol.io/)

*Based on a presentation delivered at the HRZ AI Colloquium, April 2026.*

Three Stages of AI Integration

Stage

What

Example

Chat

One prompt, one response

ChatGPT in the browser

Agent

Tool use, context-aware

OpenCode CLI

Multi-Agent

Orchestration, delegation

Oh-My-OpenAgent + ACP

Production Requirements

Production AI isn't about prompt engineering. It's about infrastructure. Three requirements drive every architectural decision:

Secure — Data sovereignty, GDPR compliance. Your code and prompts don't leave your network unless you decide they do.

Controllable — Every agent decision is traceable. You can audit which model served which request, at what cost, with what latency.

Integrable — APIs, existing systems, CI/CD pipelines. The AI layer doesn't exist in isolation.

LiteLLM: The Model Abstraction Layer

{ "provider": { "litellm": { "models": { "glm-5-turbo": { "name": "GLM 5 Turbo (SAIA)" }, "saia/gpt-oss-120b": { "name": "SAIA GPT OSS 120B" }, "qwen3.5-397b": { "name": "Qwen3.5 397B (local)" } }, "options": { "baseURL": "http://127.0.0.1:4000/v1", "timeout": 120000, "maxRetries": 5 } } } } ```text The proxy handles automatic fallback (SAIA goes down, traffic routes to local GPUs with zero code changes), unified logging (every API call recorded to PostgreSQL with cost, latency, and token counts), Redis caching (identical prompts computed once), and rate limiting (budget per model, per user). ### Agent-to-Model Mapping Each agent category runs on its optimal model. The orchestrator doesn't know or care which provider serves the request: | Category | Model | Why | | ---------- | ------- | ----- | | ultrabrain | GLM 5 Turbo | Hardest reasoning tasks | | deep | GLM 4.7 | Complex implementation | | visual-engineering | Qwen3.5 122B | UI/UX, styling, animation | | quick | Gemma 4 26B | Trivial fixes (fast, free) | | writing | GLM 5.1 | Documentation, articles | Six more categories follow the same pattern. Model swap is a one-line change in the LiteLLM config — no agent code touched. ## OpenCode: The Agent Framework [OpenCode](https://github.com/anomalyco/opencode) is a CLI agent that reads, writes, and tests code autonomously. It routes through LiteLLM for multi-model support, runs 30+ parallel background agents with context compaction to stay within token limits, and integrates LSP + AST-grep for real code understanding across 25 languages. Three plugins extend OpenCode's capabilities: - **Oh-My-OpenAgent** — Multi-agent orchestration (the subject of the next section) - **DCP** — Context management and persistence - **Morph** — Runtime configuration switching The architecture connects models, MCP servers, plugins, and LSP in a single CLI process. Agents spawn as background tasks, each with isolated context, communicating results back to the orchestrator through structured output. ## Oh-My-OpenAgent: Ten Specialised Agents The orchestrator (Sisyphus) delegates to specialised agents by role. Each agent runs on its optimal model via LiteLLM: ### The Brain: Orchestrator and Consultant Agents | Agent | Role | Model | Why This Model | | ------- | ------ | ------- | ---------------- | | Sisyphus | Orchestrator | GLM 5 Turbo | Strong reasoning for delegation decisions | | Prometheus | Planner | GLM 4.7 | Structured work breakdowns | | Oracle | Consultant | GPT OSS 120B | High-IQ read-only analysis | | Metis | Pre-planning | GLM 5 Turbo | Ambiguity and edge case detection | | Momus | Reviewer | Qwen3.5 122B | Plan quality assurance | ### The Hands: Worker and Research Agents | Agent | Role | Model | Why This Model | | ------- | ------ | ------- | ---------------- | | Sisyphus-Junior | Worker | Devstral 2 123B | Code generation, cheaper than orchestrator | | Atlas | Worker | GLM 5 Turbo | Implementation execution | | Librarian | Reference search | Qwen3.5 35B | External docs and API research | | Explore | Code search | Gemma 4 26B | Fast local grep (cheap) | | Multimodal | Vision | GLM 4.6V | Image analysis, screenshots | ### A Real Workflow User: *"Fix the scheduler bug in the neo4jknowledgebase repo"* 1. **Sisyphus** (GLM 5 Turbo) — Detects bugfix intent, routes to investigation 2. **Explore** (Gemma 4 26B) — Finds scheduler code across the repository 3. **Librarian** (Qwen3.5 35B) — Researches the embedding API that's failing 4. **Oracle** (GPT OSS 120B) — Root cause analysis: model removed from config 5. **Sisyphus-Junior** (Devstral 2 123B) — Implements the fix 6. **Sisyphus** (GLM 5 Turbo) — Verifies build passes Six different models, roughly fifteen API calls, all through one LiteLLM endpoint. Total cost: a few cents. ## Skills, Plugins, and MCP: Extensibility MCP (Model Context Protocol) connects agents to external tools: | MCP Server | Function | Transport | | ------------ | ---------- | ----------- | | Inkscape | SVG creation and manipulation | Local | | CodeGraphContext | Code-indexed knowledge (Neo4j KG) | Local (SSH tunnel) | | Playwright | Browser automation, screenshots | Built-in | | Git | Repository operations | Built-in | | File System | Read/write access to workspace | Built-in | CodeGraphContext indexes four repositories (131 files, 736 functions) into Neo4j, giving agents queryable knowledge about their own codebase — class hierarchies, call chains, dead code detection. Skills are reusable workflow templates that encode best practices: | Skill | Description | | ------- | ------------- | | graphwiz-reporter | Autonomous: KG + RSS → research → published article | | test-driven-development | Write tests first, then implement | | systematic-debugging | Structured error analysis before proposing fixes | | dispatching-parallel-agents | Distribute independent tasks in parallel | | verification-before-completion | Evidence before assertions | ## ACP: The Agent Client Protocol [ACP](/talks_and_thoughts/agent-client-protocol) (Agent Client Protocol, by Zed Industries) is the LSP moment for AI agents — a JSON-RPC standard for communication between clients and agents. One protocol, any editor, any agent. ```text Client (Editor/CLI) → JSON-RPC (stdio/HTTP) → Agent (Claude/Codex/Gemini/OpenCode/...) ```text As of April 2026: 30+ agents, 40+ clients. The practical value is multi-project orchestration: ```bash # Batch prompt across all Python projects acp run "check for outdated dependencies" --tags=python # Autonomous improvement loop with session rotation acp loop next-graphwiz-ai --max-iter 100 --rotate 25 ```text Session rotation every 25 iterations prevents context saturation — the agent starts fresh but picks up where it left off through persistent state. ## Infrastructure: Ansible Automation Nine hosts, everything as code. Ansible playbooks handle the full stack: | Playbook | Function | | ---------- | ---------- | | litellm-proxy | Docker Compose: LiteLLM + PostgreSQL + Redis | | opencode-deploy | Build and install OpenCode from source (Go) | | opencode-sync | Sync agent configs to all hosts | | knowledge-graph | Neo4j + CodeGraphContext deployment | | vpn-hub / vpn-peers | WireGuard mesh networking | | traefik | Reverse proxy + TLS (Let's Encrypt) | ```bash ansible-playbook site.yml --limit ai_cluster ```text The full stack: Traefik terminates TLS and routes to the LiteLLM proxy, which fans out to SAIA, Z.ai, or local GPU nodes. OpenCode agents connect to LiteLLM over the internal network. MCP servers (Inkscape, CodeGraphContext) run as local processes. Everything is reproducible from a single `git pull && ansible-playbook`. ## Three Decoupling Layers The architecture has three critical decoupling points: | Layer | Tool | Decouples | | ------- | ------ | ----------- | | Models from agents | LiteLLM Proxy | Swap any model without touching agent code | | Orchestration from implementation | Oh-My-OpenAgent | The planner and the worker are separate processes | | Agents from editors | ACP | Run the same agent from VS Code, Zed, or the CLI | Each layer can evolve independently. Upgrade your model roster? Change LiteLLM config. Add a new agent role? Add it to the orchestrator. Switch editors? Install a different ACP client. Nothing breaks because nothing is tightly coupled. ## Resources - [SAIA Documentation](https://docs.hpc.gwdg.de/services/saia/index.html) - [OpenCode](https://github.com/anomalyco/opencode) - [Oh-My-OpenAgent](https://github.com/code-yeongyu/oh-my-openagent/) - [LiteLLM Proxy](https://docs.litellm.ai/docs/proxy/proxy) - [ACP Specification](https://agentclientprotocol.com) - [CodeGraphContext](https://github.com/tobias-weiss-ai-xr/CodeGraphContext) - [SAIA Plugin for OpenCode](https://codeberg.org/graphwiz-ai/opencode-saia-plugin) - [Superpowers Skills](https://github.com/obra/superpowers) - [MCP Protocol](https://modelcontextprotocol.io/) *Based on a presentation delivered at the HRZ AI Colloquium, April 2026.*