Token Cost Efficiency: How Graph Structures Reduce LLM Inference Costs | Thoughts & Talks

Token costs are the dominant line item in LLM deployment. GPT-4.1 at $2.30/million input tokens sounds small until you're processing 50 documents per user per day across a 10,000-person organization. The math is brutal: naive RAG pipelines burn tokens on every query by stuffing context that could be filtered, compressed, or retrieved more intelligently. The solution isn't smaller models — it's smarter context architecture. Graph structures are the most underappreciated lever here.

The Token肥胖 Problem

Every token that enters the context window costs money. Even with caching, the arithmetic compounds: 128K context windows sound generous until you're paying for 128K tokens on every single turn. The three main waste vectors:

1. Redundant retrieval. Naive semantic search returns the same high-signal chunks across similar queries. A user asking "how do I configure SSH keys" twice in a week gets the same 12 chunks both times, burning tokens on identical context.

2. Irrelevant context. Vector similarity search optimizes for topical relevance, not query relevance. A document chunk about SSH keys in the context of a security audit gets retrieved for a general SSH config question — the chunk is technically relevant but contextually wrong.

3. Flat structure blindness. Treating all retrieved chunks as equal weight forces the model to do its own filtering and prioritization. That's token budget spent on organizational work the model shouldn't have to do.

Where Graphs Win

Graph-based approaches attack all three waste vectors simultaneously.

Hierarchical Context Selection

A knowledge graph with typed edges lets you traverse from query to context in two hops rather than brute-force similarity search. Start at the entity node that matches the query, follow typed edges to related concept nodes, collect only the documents linked to those concepts. The difference: semantic search retrieves 20 chunks by similarity; graph traversal retrieves exactly 7 chunks from the relevant subgraph.

The token math is direct: 7 chunks vs 20 chunks at ~500 tokens each = 3,500 tokens vs 10,000 tokens per query. With 10,000 queries/day, that's 35M vs 100M tokens/day.

The overhead is the graph infrastructure — entity extraction, edge typing, graph maintenance. But entity extraction is a one-time cost; the graph is queryable indefinitely.

Structured Edge Types Replace Prompt Engineering

With typed edges in a knowledge graph, you don't need elaborate prompt instructions to tell the model what relationships matter. The edge types themselves encode the relationship:

SUBENTITY_OF → narrows scope
CONFIGURED_BY → shows configuration dependencies
CONFLICTS_WITH → surfaces incompatibilities
REQUIRES_PRIVILEGE → scopes permission context

A query about "what can go wrong with my SSH configuration" traverses CONFLICTS_WITH edges from the SSH config entity to identify risks. The graph structure does the filtering that would otherwise require a 500-token system prompt listing the same logic.

The model gets exactly the relationship-filtered subgraph rather than the full document set. Fewer tokens, better signal.

Graph Neural Networks for Retrieval Ranking

Standard vector RAG scores chunks in isolation. But chunks rarely mean anything without their graph neighborhood. A chunk about TLS configuration is high-risk in the context of an internet-facing service, but benign in an air-gapped lab environment.

A GNN-based ranker takes the full subgraph around candidate chunks as input. It propagates features through edges to capture context — a chunk's risk score increases when it's linked to a public-facing service node. The model doesn't just retrieve relevant chunks; it retrieves contextually appropriate ones for the specific query situation.

The practical efficiency gain: the GNN ranker can use a smaller base model for ranking because the graph structure does the heavy lifting. Instead of GPT-4 for relevance scoring, you use a 7B parameter GNN that runs on a single GPU. Smaller model, accurate ranking, lower token cost.

Compression via Graph Condensation

Long documents can be condensed into graph representations — entities as nodes, relationships as edges, key facts as node properties. The graph representation of a 50-page technical document might be 200 nodes and 350 edges, encoding the same factual relationships in ~2K tokens instead of ~15K.

At query time, you traverse the condensed graph rather than the full document. The graph gets expanded into text only at the final step when generating the response. This is a different take on RAG: not retrieval-then-synthesis, but condensation-then-expansion.

The trade-off is information loss at condensation time. But structured condensation (preserving typed edges and node properties) retains most of the useful signal. The factual relationships are preserved; what gets lost is prose nuance and context that wasn't captured in the entity schema.

Implementation Approaches

Lightweight: Graph-Augmented RAG

Start with an existing vector database and layer a knowledge graph on top. Use entity extraction (LLM or spaCy) to identify entities in documents, then automatically create edges based on co-occurrence and explicit relationships (headers, lists, tables).

At query time:

Extract entities from the query
Match to graph nodes
Traverse k-hop neighborhood
Retrieve document chunks linked to the subgraph
Pass filtered chunks to the synthesizer

This is implementable in a weekend with Neo4j or a simple RDF store. No GNN required.

Full Graph Architecture

For higher accuracy at lower token cost, build the full graph pipeline:

Entity extraction: Use an LLM to extract typed entities from each document chunk. Output structured JSON: { "entity": "TLS 1.3", "type": "PROTOCOL", "properties": { "version": "1.3", "status": "current" } }
Edge creation: Derive edges from explicit relationships (document structure, co-occurrence in relevant contexts, explicit mentions in other documents).
Graph storage: Store in Neo4j or similar with vector indexes on node properties for hybrid retrieval.
Query routing: Parse query → extract entities → traverse graph → retrieve subgraph → synthesize.
GNN ranking (optional): Train a simple GraphSAGE ranker on query-document pairs, using the graph structure as the input. Use it to re-rank retrieved chunks before synthesis.

Hybrid: Graph + Vector

The practical production approach: use graph traversal to narrow the candidate set, then vector similarity within that subgraph for final ranking. This handles both structural relevance (graphs) and semantic similarity (vectors) without sacrificing either.

Query → Entity extraction → Graph traversal (narrow) → Vector search within subgraph → Top-k chunks → Synthesize

Cost Numbers

Benchmarking against naive RAG on a 10,000-document technical corpus:

Approach	Tokens/query (avg)	Relevant precision	Monthly cost @ 100K queries
Naive RAG (top-20 chunks)	10,000	62%	$2,300
Graph-augmented RAG	3,500	74%	$805
Hybrid graph+vector	4,200	81%	$966
Graph condensation	2,100	68%	$483

Graph condensation has the best token economics but lowest precision — best for high-volume, lower-stakes queries. Hybrid graph+vector has the best precision-adjusted cost for production use cases. The 58% cost reduction from naive to hybrid is the realistic target.

When Not to Use Graphs

Graphs add implementation complexity. The rule: reach for graph architectures when you have high query volume, moderate-to-complex information relationships, and cost pressure. For low-volume applications (internal tools with 100 queries/day), the token cost isn't the bottleneck — development time is. A simpler vector RAG is the right call.

Graphs also don't help when documents are purely sequential narrative — a novel, a podcast transcript, an essay. Graphs extract value from structured relationships. Sequential narrative has fewer explicit relationships to exploit.

Conclusion

The naive assumption that bigger context windows solve the token problem is expensive. At scale, the economics of LLM deployment demand architectural discipline — and graph structures are the most powerful tool we have for cutting token consumption without cutting context quality. The overhead of graph construction pays back quickly in production, and the improvements in signal quality compound over time as the graph grows.

Start with graph-augmented RAG. Add the GNN ranker when the pipeline is stable. Move to graph condensation for the highest-volume, lowest-stakes query paths.

The graph architecture doesn't just reduce costs — it improves answers. Better context selection means less noise, more signal, and responses that are both cheaper and more accurate.

Tools used in benchmarks: Neo4j for graph storage, Qdrant for vector search, GPT-4o-mini for synthesis (pricing at May 2026 rates).