Local Inference Stack with MiniMax M2.7 for Extraction, PyMC for Calibrated Probabilities
· ~5 min readWhen Microsoft Research documented that frontier AI agents corrupt documents across 20 interactions and Google's Threat Intelligence Group confirmed criminal hackers used AI to find and weaponise a real zero-day exploit, the response from the security community was mostly alarm without direction. The failure modes are clear. The architectural fix is less discussed.
This piece describes the engineering architecture that addresses both problems: a split design where the LLM handles signal extraction and a purpose-built probabilistic engine handles all Bayesian inference. The result is a local inference stack that produces calibrated posterior distributions rather than next-token predictions dressed up as probability estimates.
The Split Architecture
The core insight from Qiu et al. (2026) in Nature Communications is that LLMs plateau after first interaction when asked to update probabilistic beliefs. The DELEGATE-52 study from Microsoft Research shows the same failure mode in extended document workflows. Both findings point at one conclusion: do not ask an LLM to reason about probabilities over extended interactions.
The architectural fix is separation of concerns:
- LLM (MiniMax M2.7): Signal extraction from noisy, unstructured sources. Converts news feeds, forum posts, official announcements, and satellite imagery into structured observations.
- Probabilistic engine (PyMC): Exact Bayesian inference over those observations. NUTS sampling produces genuine posterior distributions over competing hypotheses.
The LLM never reasons about probabilities. It converts source material into a belief vector. PyMC processes that vector using Bayes rule and returns calibrated distributions.
Hardware: NVIDIA DGX Spark (GB10) at 10K USD
The hardware that makes this practical is NVIDIA's DGX Spark, available in a 2x stacked configuration for around ten thousand USD. The ProX PC benchmarks give concrete numbers:
| Metric | 70B+ Models | 7B Models |
|---|---|---|
| Generation speed | 4.6 tok/s | 46 tok/s |
| First-token latency | ~180s | ~18s |
| Total power draw | less than 100W | less than 100W |
| Unified memory | 128 GiB | 128 GiB |
The 180-second first-token latency for 70B models is a fixed prefill cost per ingestion cycle. For a geopolitical risk workflow that re-ingests 40 articles daily, you pay it once. The LLM generates at 4.6 tok/s while PyMC runs NUTS chains in the background—these are independent workloads that do not compete for resources.
The Workflow in Practice
For geopolitical risk modelling, tracking of ceasefire probabilities, sanctions escalation paths, and commodity flow disruption, the workflow uses these steps:
-
First: MiniMax M2.7 architecture with 200K context, MoE with 10B active parameters, and 62 layers processes source documents. The LLM extracts structured signals into a belief vector containing event type, actors involved, temporal markers, sentiment, and corroboration count.
-
Second: PyMC receives the belief vector. The probabilistic engine defines priors over competing hypotheses (ceasefire holds versus collapses, sanctions escalate versus plateau), conditions on the observed signals via likelihood functions, and runs NUTS sampling to compute posteriors.
-
Third: The posterior encodes calibrated probabilities rather than point estimates. Probability distributions for shipping lane closures within 90 days can be computed and updated incrementally as new signals arrive.
For commodity flow analysis including oil, rare earth elements, and semiconductor inputs, the posterior distribution over delivery disruption probabilities is exactly what risk desks need. A point estimate from an LLM that plateaued after the first observation is not a probability; it is a number that sounds like one.
Deployment Architecture
The practical deployment on 2x GB10:
- Node 1: MiniMax M2.7 via vLLM on the GB10 unified memory of 121 GiB. Prefill is a one-time cost per ingestion cycle.
- Node 2: PyMC as a FastAPI service on the GB10 ARM Neoverse cores for NUTS sampling. CPU-only compute since NUTS does not need a GPU.
- Inter-node link: 200G QSFP56 ConnectX-7 handles the signal payload between nodes.
Alternatively, both run on a single 2x GB10 stack with the LLM at reduced batch size to leave headroom for PyMC sampling chains. The power envelope is under 100W total.
Why This Changes the Security Posture
When inference runs locally on GB10 hardware, you are not sending raw source material to a third-party API. You are not trusting a frontier model next-token predictions as probability estimates for high-stakes decisions. You are running calibrated NUTS samplers that produce genuine posterior distributions.
The split architecture also means the LLM is a stateless signal extractor. There is no context accumulation across interactions, no catastrophic forgetting, no silent corruption over extended sequences. Each document ingestion is independent. The probabilistic state lives in the PyMC posterior, which is updated by Bayes rule—not by context window management.
For security teams, this is the relevant point: the architecture that makes AI agents unreliable for workflow automation (document corruption, plateau-after-first-interaction) is the same architecture that makes cloud-dependent LLM inference risky for high-stakes decisions. The fix is not to wait for better models. The fix is to change what you ask the LLM to do—and move probabilistic inference to a purpose-built engine running locally.
What Frontier Models Cannot Do
The Nature Communications finding that LLMs plateau after first interaction is a property of how these architectures update beliefs, not a bug that will be patched in the next generation. The DELEGATE-52 finding that adding tools to agents makes performance worse further constrains what agentic architectures can be trusted to do without supervision.
The 10K USD 2x DGX Spark setup addresses both failures directly. MiniMax M2.7 extracts signals from noisy sources. PyMC NUTS sampler maintains and updates calibrated probability distributions. The LLM is never asked to reason about beliefs over extended sequences. The probabilistic engine is never replaced by a next-token predictor.
The hardware is here. The split architecture is the right design. Whether organisations treat this as an engineering priority or a future consideration is the only remaining question.