Oort Labs Team 12 min read

AI agents in 2025: what actually happened vs the hype

2025 was supposed to be the year AI agents went mainstream. The investment was real, the ambition was real - but so was an 80% enterprise failure rate. Here's what the data actually shows.

Updated

AI AgentsEnterprise AIAI SecurityUK Tech
AI agents in 2025: what actually happened vs the hype

TL;DR

  • 2025 was marketed as the “Year of the AI Agent.” Experimentally, that’s true - 62% of organisations tried agents. But only 23% successfully scaled one.
  • US AI investment hit $109 billion. The UK invested $4.5 billion but took a more focused, sovereign approach centred on safety and defence.
  • Enterprise agent failure rates exceeded 80%, driven by legacy infrastructure collisions, hallucinated tool calls, and context window overload.
  • The security threat surface expanded dramatically. Memory poisoning, excessive agency, and prompt injection moved from theoretical to exploited.
  • Jobs didn’t disappear. Task composition shifted. The most valuable skill in 2025 turned out to be knowing how to manage agents, not just use them.

Every January, someone declares the year that’s just ended a watershed moment for AI. For 2025, the watershed was supposed to be autonomous agents - software that doesn’t just answer questions but plans, acts, and iterates without you in the loop.

The investment was real. The boardroom attention was real. But when you look at enterprise deployment data rather than conference keynotes, a more complicated picture emerges: extraordinary ambition running headlong into legacy infrastructure, half-baked architectures, and security models designed for a pre-agent world.

This is what actually happened.


The investment gulf: US capital saturation vs UK sovereign precision

Money shapes what gets built. In 2025, the money was overwhelmingly American.

US private AI investment hit $109.1 billion - nearly twelve times China’s $9.3 billion and roughly 24 times the UK’s $4.5 billion. That capital advantage translated directly into model output: US institutions produced 40 frontier models over the period; all of Europe produced three.

But raw capital isn’t the whole story. The UK made a deliberate choice not to compete on volume. Instead of spreading investment thin, the government concentrated it on asymmetric advantages: AI safety, quantum computing, and defence technology.

The UK’s bet is that being the world’s most credible evaluator of AI safety is worth more in the long run than being the third-largest model producer. That’s a reasonable bet. But it requires execution, and the gap between ambition and delivery has historically been a British speciality.

Regulatory postures also diverged. The US introduced 59 AI-related federal regulations in 2024 alone - more than double the prior year - and executive action created a national preemption framework to prevent state-level fragmentation. The UK’s AI Safety Institute (AISI) took a different route: rigorous empirical evaluation of commercial models, proactively uncovering critical vulnerabilities before deployment rather than legislating after the fact.

Public sentiment tracked these approaches. US optimism about AI’s societal benefits sat at 39%. UK optimism grew faster, up 8 percentage points, suggesting the population is more receptive when institutions visibly prioritise safety.


The valuation problem nobody wanted to discuss

Behind the optimism, the Bank of England’s Financial Policy Committee was quietly alarmed.

OpenAI reached a $500 billion valuation. Anthropic hit $170 billion - nearly triple its value from nine months earlier. The FPC warned explicitly that global equity markets appeared severely stretched and that the UK financial system, as an open economy, faced real spillover risk if AI expectations corrected sharply.

The academic data didn’t help. An MIT study found that up to 95% of organisations were achieving zero immediate financial return on their generative AI investments. The infrastructure spend - industrial-scale data centres, energy contracts, GPU clusters - was compounding faster than the productivity gains that were supposed to justify it.

This isn’t an argument that the investment was wrong. Infrastructure spending often looks bad before it looks good. But it does explain why 2025 felt simultaneously like unprecedented progress and a market straining under its own weight.


What “Year of the AI Agent” actually looked like

To understand how 2025 played out, it helps to be precise about what an AI agent actually is. It’s not a chatbot with a nicer interface. An agent can plan, reason, use external tools, call APIs, and execute multi-step workflows with little or no human supervision. The difference between a language model and an agent is the difference between a tool that advises and a tool that acts.

By that definition, 2025 delivered:

  • 88% of organisations reported regular AI use in at least one business function (up from 78% in 2024 and 55% in 2023)
  • 62% experimented with AI agents specifically
  • 23% successfully scaled an agent to production in at least one function

That last number is where the hype meets the data. Nearly two-thirds of organisations tried agents. Fewer than one in four got one into production at scale.

Adoption varied heavily by sector:

SectorAI adoption rateAgentic scaling success
Technology / Telecom94%45%
Healthcare89%38%
Financial Services87%32%
Retail82%28%
Manufacturing79%31%

Healthcare was the standout. Specialised AI tools were implemented at seven times the 2024 rate, driven by $1.4 billion in targeted investment. Ambient clinical documentation - agents that transcribe and structure clinical encounters in real time - generated $600 million alone, growing 2.4x year-over-year. Diagnostic assistance agents proved twice as accurate as humans on complex medical imaging tasks. This is what domain-specific, well-constrained agentic AI looks like when it works.


Why agents kept failing

The headline number is that enterprise AI project failure rates exceeded 80% in 2025 - twice the failure rate of traditional IT projects. 42% of companies abandoned the majority of their AI initiatives (up from 17% in 2024). 46% of proof-of-concept projects were scrapped before reaching production.

Understanding why requires looking at how agents actually break.

Context window overload

Enterprise teams frequently treated agent context windows as unfiltered data repositories - dumping entire Salesforce instances, internal wikis, and thousands of support tickets without semantic scoping. The result was “Lost in the Middle” failures: the agent retrieved the correct document but ignored it because it was buried under irrelevant noise.

Instruction drift

Due to attention decay in transformer architectures, system prompt constraints lose weight over long sessions. An agent told to “only output clean TypeScript” or “strictly follow GDPR protocols” would routinely drift from those constraints after 20–25 turns, particularly if the conversation provided conflicting examples. The engineering fix - “context pinning,” which re-injects critical constraints before every generation cycle - wasn’t obvious and required deliberate architectural discipline.

Hallucinated tool calls

This was the most dangerous new failure mode of 2025. Agents with access to tools don’t just hallucinate text - they hallucinate API arguments. An agent might query a database using user_id instead of the system’s required customer_uuid because the former appeared more frequently in its training data. The database returns zero rows. The agent reports a confident negative answer. Silent hallucination in a live system is orders of magnitude more dangerous than a chatbot producing a wrong sentence.

The “Polling Tax”

A financially painful discovery: agents waiting for external systems to process data would sometimes enter hyperactive query loops, hitting an API endpoint hundreds of times per second. The system returned 200 OK each time, so traditional monitoring never flagged it as an error. It ran completely undetected until financial audits revealed the runaway token costs. One team’s infrastructure bill tripled before anyone noticed.

Legacy infrastructure collisions

The UK banking sector made this failure mode visceral. Nine major UK banks accumulated over 803 hours of unplanned outages in the reporting period. The problem: roughly 43% of banking systems still run on COBOL, handling 95% of ATM transactions and $3 trillion in daily global commerce. When an autonomous agent encountered an unhandled API schema change or timeout from a mainframe, it didn’t crash cleanly. It tried to reason around the error - sometimes hallucinating a success state, sometimes launching recursive queries that functioned as accidental internal DDoS attacks. The TSB migration disaster of 2018 had already shown what happens when you stress-test this infrastructure. Attaching agents to it made the problem nonlinear.


The architectural response: from monoliths to swarms

The industry’s answer to chronic single-agent failure was structural: stop expecting one large model to do everything, and start treating agents like distributed microservices.

The patterns that matured in 2025:

Sequential pipelines - agent A’s structured output becomes agent B’s input. Deterministic, auditable, brittle only at defined interfaces.

Parallel execution - multiple specialist agents work simultaneously, with a central aggregator combining outputs. A user profile evaluated concurrently from financial, legal, and operational angles.

Review and critique - an Actor agent generates output; an independent Critic agent evaluates it against defined rubrics before anything reaches production. This is the closest thing to automated quality control the field has developed.

ReAct (Reason and Act) - the agent interleaves reasoning steps with tool calls, adapting dynamically based on API responses. Powerful for open-ended tasks; requires careful scoping to prevent runaway behaviour.

Swarm coordination - a routing agent receives a complex objective, decomposes it into sub-tasks, and delegates to ephemeral specialist agents. High scalability, high orchestration complexity.

The shared infrastructure requirement across all of these is state management: agents need to persist context, coordinate without stepping on each other, and fail in bounded ways. Google Cloud expanded its Agent Engine for exactly this. Cloudflare’s Agents SDK separated the reasoning layer from the execution environment, providing persistent objects on edge infrastructure with built-in identity and concurrency control. The plumbing is finally starting to match the ambition.


Security: the threat surface that grew faster than the defences

When an agent can send emails, query databases, modify production files, and transfer funds, it is not a text tool. It is a privileged actor in your infrastructure. 2025 made this uncomfortably clear.

OWASP updated its Top 10 to specifically address LLM and agentic applications. The risks that matter most:

Prompt injection remains the foundational vulnerability. Malicious content in a retrieved document, a support ticket, or a web page can redirect an agent’s behaviour without the user ever interacting with it directly.

Excessive agency is the reason prompt injection is so dangerous. Organisations routinely granted agents broader permissions than any specific task required. A compromised agent with excessive agency can execute lateral movements across the entire network. The blast radius is as large as the permissions you handed out.

Memory poisoning emerged as the most insidious novel threat. Long-running agents maintain persistent memory. Attackers injected malicious instructions into data streams the agent would later retrieve - effectively planting sleeper cells in the agent’s own knowledge base.

Insecure plugin design shifted the attack surface from the model itself to the third-party tools it calls. A well-hardened LLM connected to an insecure calculator plugin is only as secure as the plugin.

Real-world breaches confirmed these weren’t theoretical concerns. The DeepLeak incident exposed a DeepSeek database containing over a million lines of log streams, granting full database access to attackers. The SAPwned vulnerability in SAP AI Core allowed attackers to cross tenant isolation boundaries and access proprietary customer cloud credentials. A critical NVIDIA container vulnerability (CVE-2024-0132) affected 35% of global cloud environments running GPU workloads.

The security architecture that started to work in 2025 combined several principles:

  • Short-lived, identity-first tokens instead of persistent API keys
  • All agent-generated code executed in ephemeral isolated containers
  • Independent deterministic guardrails - regex-based pattern matching operating outside the model - blocking prohibited outputs before execution
  • Explicit AI Bills of Materials (AI-BOM) to track shadow AI deployments that bypass IT oversight
  • Zero Trust applied specifically to agent traffic, not just human users

The NIST AI Risk Management Framework’s “Map, Measure, Manage” methodology gave organisations a workable structure. Those that applied it systematically had meaningfully fewer security incidents than those that treated agent security as an afterthought.


Workforce: the replacement that didn’t happen (but the transformation that did)

The most emotionally charged story of 2025 was the one that didn’t fully materialise: mass cognitive job displacement.

McKinsey’s 2025 Global AI Survey found that 32% of organisations expected their workforce to shrink by 3% or more due to AI. 43% expected no change. 13% expected headcount to increase. The actual US employment data, analysed across 33 months post-ChatGPT by the US Budget Lab, found no discernible aggregate disruption - the pace of change was consistent with prior technological transitions like the early internet.

That’s not to say nothing changed. What changed was what work looks like.

The UK AI Safety Institute’s productivity study found consistent evidence that AI slashes the time cost of routine, repetitive tasks. PwC’s 2025 Global AI Jobs Barometer, covering nearly a billion job advertisements across six continents, concluded that AI increases the value of human workers in almost every sector it enters - not by replacing expertise, but by making expertise go further.

The shift is from task executor to workflow manager. The workers extracting the most value from AI aren’t the ones using it as a glorified autocomplete. They’re the ones treating agents as a team they direct, reviewing outputs critically, providing context, catching errors, and escalating decisions that shouldn’t be automated. The skills that make a good manager - clear delegation, contextual framing, judgment about when to escalate - turn out to map almost perfectly onto effective agent orchestration. Teams with strong management-oriented workers got 75% more value from AI agents than those without.

The genuine problem isn’t displacement - it’s skills deficit. A 2025 UK Labour Market Survey found that 60% of respondents identified a gap in basic AI comprehension, up substantially over five years. 28% of UK organisations said technical skill shortages were actively preventing them from hitting business goals. Employees are frequently more ready than leadership assumes - many are already using shadow AI tools daily - but the institutional scaffolding for formal upskilling is lagging badly.


Key takeaways

2025 established the agentic paradigm. It did not deliver on the full promise - and that gap is useful information.

The organisations that succeeded shared a few characteristics: they deployed agents into well-bounded, domain-specific tasks rather than chasing general autonomy. They built deterministic policy layers that operated outside the model. They gave agents the minimum permissions needed for each task, not the maximum available. And they treated security as an architectural constraint from the start, not a feature to add later.

The organisations that failed, overwhelmingly, did the opposite: they bolted agents onto legacy infrastructure without modernising it, granted excessive permissions for simplicity, treated the system prompt as a security boundary, and deployed without adversarial testing.

The infrastructure for reliable autonomous execution - better state management, independent guardrails, structured multi-agent orchestration - improved substantially in 2025. The gap between ambition and execution narrowed. But it didn’t close. The year that delivers seamless, trustworthy autonomous enterprise AI is still ahead.


At Oort Labs, we help organisations design agentic systems that are secure by architecture - not by hope. If you’re navigating enterprise agent deployment, we’d like to talk.


FAQ

Was 2025 really the “Year of the AI Agent”?

Experimentally, yes. The majority of enterprises ran some form of agent pilot. But “year of the AI agent” implies production-grade deployment, and only 23% of organisations successfully scaled an agent to production. It was more accurately the year enterprise AI collectively discovered how hard production deployment actually is.

Why did so many AI agent projects fail?

The most common causes were: context window overload (feeding agents too much unstructured data), instruction drift (system prompt constraints degrading over long sessions), hallucinated tool calls (agents inventing API arguments), legacy infrastructure collisions, and insufficient security architecture. Most of these are engineering problems, not fundamental limitations of the technology.

What’s the difference between an AI agent and a regular LLM?

A language model generates text in response to input. An AI agent can plan, use tools, call external APIs, take actions in systems (send emails, query databases, modify files), and chain multiple steps together autonomously. The distinction matters enormously for security: an agent is a privileged actor in your infrastructure, not a text generator.

Did AI cause widespread job losses in 2025?

Aggregate employment data doesn’t support a mass displacement narrative. What happened instead was a shift in task composition: routine cognitive tasks compressed, while work requiring judgment, context-setting, and oversight of AI systems became more valuable. The bigger labour market problem was a skills deficit - 60% of UK workers identified a gap in basic AI comprehension - rather than job destruction.

What should organisations do differently when deploying agents?

Start with domain-specific, bounded use cases rather than general autonomy. Build a policy enforcement layer outside the model. Give agents minimum-viable permissions. Treat system prompts as configuration, not security. Run adversarial testing before production. Track every AI asset with an AI Bill of Materials. And assume agents will fail in novel ways - design for graceful, bounded failure rather than assuming the agent will handle edge cases correctly.