Artificial Intelligence and agents in companies: what separates those that work from those that fail

What separates solutions that work from those that fail? Image: Taqtile

By Nix Lopes

Over the past two years, the technology industry has been going through a phase where software with AI agents is moving from Proof of Concept (POCs) that sell dreams to workflows of real and complex scenarios. High failure and frustration rates have accompanied this change, as evidenced by the widely cited MIT study State of AI in Business in 2025, analyzed in detail by our team in the Radar AI Taqtile 2025.

On the executive side, there is pressure for a return on the high investment and high expectations for end-to-end process automation. On the technical side, there is a growing view that the software with AI agents that go into production fail not due to technological limitations, but due to the lack of the engineering necessary to blend the use of language models with deterministic processes, the correct measurement of results, and the implementation of evolution methods and agent learning.

In the initial experiments with the technology in the industry, the modus operandi was to treat these software as black boxes, expecting the results to be generated magically, with quality and accuracy. Thus, language models gained total autonomy. At first glance, solutions of this type are tempting due to their ease and speed of implementation. However, this method underestimates something intrinsic to the nature of these models: hallucinations.

Language models are nothing more than probabilistic models, with percentages of success and failure. In this context, the current challenge for software engineers is to find methods to leverage the benefits of the technology while circumventing its limitations. This is even more important when applied in a real scenario with hundreds of variables, sensitive data, and users of different profiles, where mistakes can be costly for the business.

What research reveals about AI agents

To understand what separates successful projects from failing ones, a study conducted by researchers from Berkeley, Stanford, and IBM offers one of the most concrete answers so far.

The Measuring Agents in Production (MAP) is a systematic survey of AI agents deployed in production. Unlike previous studies like the MIT NANDA (discussed in an article by Danilo Toledo, co-founder of Taqtile), the MAP stands out not only for the volume of cases included but also for its restriction to projects that went into production and for interviewing engineers instead of executives.

Based on a questionnaire with four questions, answered by 306 engineers who implement and operate agents in their daily work, distributed across 26 market niches, the study gathered insights on which practices are most effective in developing agent-based systems. The questionnaire included the following questions:

  • What are the applications, users, and requirements of the agents?

  • What models, architectures, and techniques are used to build agents that go into production?

  • How are the agents evaluated before deployment (going into production)?

  • What are the main challenges in building agents deployed in production?

Key findings

The most significant finding of the MAP was that successful agents in production do not possess high autonomy. Being purposefully conservative, they rely on human validation during processes (Human-in-the-loop)* to ensure the reliability of results and learning.

Although 52% of projects use the LLM-as-a-judge technique*, 74% primarily depend on human evaluation. Among the systems studied, 68% execute at most ten steps before requiring human intervention, with half executing fewer than five.

Graphs created by Taqtile. Source: Measuring Agents in Production (Dec 2025)

The study investigated how teams measure the value of delivered products. Most opted for experiments within their own ecosystem, conducting A/B tests, manual feedback, and business-specific metrics such as time per task, rework rate, human escalation rate, cost per resolved case, and impact on operational indicators (SLA, conversion, avoided losses).

In other words, they did not rely on ready-made market benchmarks. This reinforces that in the scenario we are in, "right" and "wrong" are operational and business definitions, not academic ones.

These findings align with what MIT NANDA (State of AI in Business in 2025) already evidenced: most projects fail because they do not learn from feedback, are not properly measured, do not fit into the actual flow of business, and are not reliable due to sensitive processes executed in a non-deterministic manner.

Another noteworthy result is that 70% of projects use ready-made models, without fine tuning*, focusing only on Prompt Engineering*. The study also mapped the low adoption of ready-made agent frameworks: 85% of cases built custom applications to gain control and predictability. Regarding the balance between output quality and response time, 66% of participants chose to prioritize quality, even if this meant a response time of several minutes.

Conditions for success in process automation with AI

The research points out the conditions under which agents have the greatest potential to generate results in production:

  1. there are repetitive tasks with high human cost; 73% of respondents developed agents for automating manual tasks and increasing efficiency.

  2. there is a verifiable "operational truth," even if partial.

  3. the process accepts human-in-the-loop as a control step.

The study also provided information about the market niches in which the systems were deployed and examples of applications. We included the graph and table below for better consolidation and visualization of information:

Graph and table created by Taqtile. Source: Measuring Agents in Production (Dec 2025)

The field is still consolidating, but the MAP data shows that the direction is clearer than it seems: less autonomy, more control; less benchmark, more self-measurement; less ready-made framework, more custom engineering. Companies that understand this early have a real advantage, not over technology, but over how to use it.

This article is part of an initiative by Taqtile to share what we are learning as we develop software with AI agents. This is a topic we closely follow here. On our LinkedIn page, you can find more about this subject from the perspective of those who are building real use cases of AI in companies.

For companies that want to start this journey methodically, the AI Sprint is our design sprint methodology adapted for the era of generative AI. And if you want to explore how these learnings can apply to your context, talk to our team.

Footnotes:

LLM-as-a-judge: using an LLM to evaluate system responses/outputs.

Human-in-the-loop: flow in which humans validate/take on critical steps (e.g., final approval, exceptions), reducing risk and increasing reliability in production

Prompt engineering: construction/iteration of instructions in the prompt and context to guide the model

Fine tuning: adjusting the model's weights with proprietary data

References

  1. Pan, Melissa Z.; Arabzadeh, Negar; Cogo, Riccardo; et al., Measuring Agents in Production [Dec 2025]

  2. The GenAI Divide (MIT NANDA), State of AI in business 2025 report

  3. Alvarez-Teleña, S.; Díez-Fernández M., Advances in Agentic AI: Back to the Future [Dec, 2025]