Accessibility QA is slow and expensive. We tried to change that with AI agents

By Julia Ribeiro , Leticia Fonseca and N. Naomi Sato

Digital accessibility no longer suffers from a lack of diagnosis. It suffers from a lack of scale in remediation.

The data makes this clear. 94.8% of the world's top 1 million most-visited homepages had detectable accessibility failures, according to The WebAIM Million report published in February 2025. In Brazil, only 2.9% of sites passed all basic tests evaluated in 2024, according to research by BigDataCorp in partnership with Movimento Web para Todos (Web for All Movement). Meanwhile, in the day-to-day of product and technology teams, what you see is a growing backlog of already-known issues whose remediation moves slowly, or simply stays stuck on paper.

Based on what we've tracked internally over the past year, some of the improvements identified in projects remained at 0% resolution after QA; others progressed only to 6% or 43%; few reached 90% or 94%. Not for lack of technical evidence, but because accessibility keeps losing ground to competing urgencies, capacity constraints, team changes, and migration or restructuring contexts. QA as a diagnostic can be fast; the hard part is making remediation actually happen.

It was in the face of this gap between identifying and resolving that we began to investigate a simple hypothesis: can AI help make accessibility QA and its long-term maintenance more viable, faster, and scalable?

What the backlog reveals — market, society, and our work

Brazil has had legislation since 2015. The Brazilian Inclusion Law (Lei Brasileira de Inclusão) requires companies headquartered in the country to ensure accessibility in their websites and apps. But the law exists and enforcement hasn't kept up. The market doesn't demand it. Society still doesn't push hard enough.

Meanwhile, the numbers keep growing. The 2022 Census revealed more than 32 million people aged 60 or older in Brazil (IBGE, Demographic Census 2022), a 56% increase compared to 2010. Among people aged 70 and over, 27.5% have some type of disability (IBGE, Demographic Census 2022). Population aging isn't a distant trend: it's the reality that today's digital products already need to be ready for.

And accessibility goes beyond permanent disabilities. An accessible product works better for everyone: someone with an injured arm, someone holding a baby while trying to call a rideshare, someone using their phone with one hand. Inclusion and good experience go hand in hand.

The dynamics inside teams reinforce the problem. When we raise the accessibility topic with clients, the response is rarely "we don't want to." It's "not right now." Without a clear way to justify ROI to budget approvers, the topic loses to other priorities. The backlog grows. The cycle repeats.

It was thinking about how to break that cycle that we started asking: what if AI changed the equation?

Can accessibility be scaled with agentic AI?

“We hope so.” That was the bet we started the experiment with.

To understand where process automation could fit in, we mapped the four stages of the existing workflow:

  1. Accessibility QA: app inspection to find issues, expert-dependent and hard to scale

  2. Issue documentation: each problem recorded with description, evidence, and remediation guidance

  3. Development: implementing the fix based on the documentation and local testing

  4. Code review: review by other developers before merging into the main branch

The decision to focus first on touch target area, rather than tackling all five accessibility criteria at once (touch target, contrast, reduced motion, text enlargement, and screen readers), came from conversations with the team. It made sense to start with the criterion that seemed simplest, with clear mathematical rules, and build from there.

We brought AI into the QA, development, and code review stages, creating AI tools (skills) for each of them in Claude Code:

  • QA: the “Touch Scanner” (identifies touchable components in the project) and the “Touch Auditor” (evaluates whether they meet the minimum touch target area criterion).

  • Development: the “Touch Corrector,” to apply the necessary fixes.

  • Code review: verification that the changes made passed the minimum touch area criterion.

Smart division: what stays with AI and what stays with people

The beginning was simpler than the final result. We passed basic information to Claude Code and asked it to evaluate and fix everything at once. As expected, it didn’t work well. Files with violations were ignored, others were fixed incorrectly, and the worst problem was inconsistency: every time we ran the command, the result was different.

The first improvement was to break it into stages. That makes sense: when we manually solve a complex problem, we also break it into parts. But the inconsistency hadn’t disappeared, and we didn’t even know exactly how to measure the improvement.

The most important breakthrough came from a conversation with Nix from the AI team: the concept of determinism. Deterministic instructions always produce the same result for the same input. Generative AI is inherently non-deterministic, but some parts of our process could be.

Finding touchable components, for example, can be done with a precise terminal command, the same way a developer would do it manually to avoid looking through every file one by one. We turned that step into a precise command, passing as parameters the file patterns and code that contained touchable components. The result became consistent and fast.

The evaluation step couldn’t be deterministic. A component might have children that already guarantee the minimum area, or be placed in a context where changing its size would break the layout. That kind of contextual judgment doesn’t fit in a terminal command. What we did was document everything we could think of that might affect a component’s height and width, and pass that as instructions to the evaluation skill.

Our projects use design tokens, a naming system that standardizes visual values like spacing, sizes, and colors into reusable variables. In code, this appears as height: theme.spacing.small, for instance, instead of the direct numeric value.

To make the AI’s work easier in this context, we created a reference file with the resolved values for each token, along with exceptions and the project’s basic components. So when it encounters theme.spacing.small, it already knows that means 12, without having to trace the origin through the code. This made the process faster and reduced the chances of hallucination.

Building an AI capable of reliably fixing accessibility issues was where human judgment tipped the scale. It’s not enough to instruct: one needs to specify how, and in what order. For each identified problem, there are multiple possible solutions: increasing the touch target area of the component, adjusting the spacing around it, reorganizing the layout. Which one comes first?

That prioritization isn’t obvious, and an AI without context tends to choose the most interventionist solution, which can break the design or create new problems. The priority list we built reflects what we know about the company’s patterns, our template, and accessibility best practices. It’s that prior knowledge, codified into explicit instructions, that makes the difference between a usable result and one that needs to be redone.

Throughout the process, we refined far more than just priorities: we tested with and without skills, with AI agents as orchestrators, with different models, saving intermediate reports to file. At every test, the question was the same: what matters most: execution time, cost, or output quality? Anthropic’s Skill Creator (a tool for creating and evaluating AI skills) was very helpful in these comparisons, with an evals function (automated quality assessments) that allows you to compare different versions more objectively.

Our solution is far from perfect. Even with increasingly specific instructions, the AI still forgets a detail, fails to read a file, or makes something up that it shouldn’t. Once the fixes are made, we still need a human to test the app in operation and make sure nothing is off.

But it is a concrete starting point. Just like developers use AI to speed up building screens and components and then review the result, it’s possible to use AI to ensure the app is accessible and then check whether it actually resolved the issues correctly. The process shouldn’t be any different.

Touch target was the pilot. What comes next.

Many companies celebrate the AI pilot as if it were the innovation itself. The chatbot worked. The demo impressed. Done.

But the pilot is proof that the problem is worth solving that way, not that it’s been solved. The real innovation begins afterward, when the experiment becomes practice and practice becomes part of the process.

By that same logic, what we found is that it’s worth betting on AI to reduce the cost and increase the scale of accessibility QA in digital products. But this is ongoing work. Success is measured by the impact on the people who use the product, not by meeting a deadline or delivering a scope.

Project asks: “Did we ship?”

Product asks: “Is it working for whom, to what extent, and what do we need to adjust?”

The experiment showed the solution is viable. With consistency, we achieved 53% coverage of components, and in our best attempt, we reached 84.6% coverage with correct identification and remediation. Managing this as a product is what will keep that improvement going.

If you’re facing the same challenge in your products, reach out to Taqtile, we want to understand your context and help.

On our evolution roadmap: adapting the solution to the next criteria from WCAG and Brazilian standard NBR 17060, covering contrast, reduced motion, text enlargement, and screen readers.

If you’d like to pilot this with us, have suggestions, or want to contribute to scaling accessibility improvement in apps with AI: reach out, we’d love to collaborate!