AI Agents Still Cannot Track Context — And Criminals Are Already Exploiting That
· ~4 min readMicrosoft Research published findings last week that should make any team betting their workflow automation on AI agents reconsider. The study, titled "LLMs Corrupt Your Documents When You Delegate", ran frontier models—Gemini 3.1 Pro, Claude 4.6 Opus, GPT 5.4—through a 52-domain benchmark called DELEGATE-52. After 20 delegated interactions, the average model lost 25% of document content. Some domains saw 80% degradation. The sole domain where any model reached the "ready" threshold of 98% accuracy after 20 rounds: Python programming.
The week before, Google confirmed something more alarming: criminal hackers had used AI to discover and weaponise an unknown software flaw—a genuine zero-day exploit—in a planned mass hack campaign. This was not a red team exercise. This was real exploitation in the wild, and Google explicitly stated the actor had "likely leveraged an AI model" to find and weaponise the vulnerability.
These two data points are not unrelated. They describe the same gap from opposite ends of the threat landscape.
What DELEGATE-52 Actually Measured
The Microsoft Research team simulated professional workflows across 52 domains—accounting, crystallography, music notation, legal document handling—where an LLM acting as an agent receives a seed document, performs operations across multiple rounds, and must preserve fidelity. The benchmark is deliberately more demanding than sorting a spreadsheet.
The headline numbers are damning. Frontier models degraded 25% on average after 20 interactions. All models averaged 50% degradation. The best performer, Google Gemini 3.1 Pro, was "ready" for only 11 of 52 domains. But the more interesting finding is how they failed.
In weaker models, degradation manifested as content deletion—words and sections simply disappearing. In frontier models, the failure mode was content corruption: plausible but incorrect content substituted silently. And critically, errors did not accumulate gradually. They arrived all at once: 10 to 30 percentage points lost in a single round-trip.
The Microsoft researchers then equipped the same models with a basic agentic harness—file reading, writing, and code execution—and re-ran the benchmark. The results: "The four tested models perform worse when operated agentically with tools than without, incurring an additional 6% degradation."
An intern who corrupted a quarter of a document over a long workflow would be shown the door. Yet organisations are spending an average of 36% of their digital budgets on AI automation, according to Deloitte.
The Architecture Problem Is Not Fixable by Scale
The DELEGATE-52 findings align with separate research published in Nature Communications (Qiu et al., 2026, s41467-025-67998-6), which examined whether LLMs can update probabilistic beliefs across multiple interactions—the core requirement for any adaptive system. The answer is no. LLMs perform significantly worse than a normative Bayesian assistant, and most models plateau after the first interaction. Give them five rounds of the same user choice data and their recommendations do not improve beyond what a single round produces.
The Nature paper's finding is precise: "in contrast to the Bayesian Assistant, which gradually improves its recommendations as it receives additional information about the user's choices, LLMs' performance often plateaus after a single interaction, pointing to a limited ability to adapt to new information."
This is not a bug that next year's model refresh will fix. It is a property of how these architectures handle belief update over extended interaction sequences. Both studies point at the same architectural limitation: the context window is not a working memory in the computational sense, and adding more parameters does not change that.
The Offensive Use Case Moved Faster
The Microsoft DELEGATE-52 study describes why enterprise AI agents are unreliable for workflow automation. The Google zero-day disclosure describes why those same capabilities, in adversarial hands, are already operational.
The gap between "AI agents fail at 20-step workflows in tests" and "AI-assisted zero-day exploits confirmed in the wild" is not comforting. It took the offensive use case considerably less time to become operational than it took the research community to document the failure mode in a benchmark.
Organisations betting on AI agents for automation should understand that the capability gap is architectural. The response is not to wait for frontier models to improve. The response is to change the architecture: separate the LLM from probabilistic inference, run inference locally, and treat the LLM as a signal extraction layer rather than a reasoning engine.
The hardware to do this exists. The software stack is open source. The question is whether organisations will treat this as an engineering problem or continue to paper over the architectural gap with marketing spend.