Synthetic Personas in Multi-agent Architecture: what changes when you stop trusting 'two smart AI agents'

By Ysabella Andrade

In my previous article, I described how I built my first synthetic persona pipeline with AI agents (a Researcher agent consolidating data, and a Senior UX agent simulating the interview).

It worked. It helped refine the prototype before going to team validation and subsequent testing with real users, and it taught me more about the client's business.

But at the end of that article, I left a door open:

This opened doors to exploring tests in a multi-agent structure suited for AI agents for business: splitting test objectives across more than one agent

This article is about what happened when I actually opened that door, and discovered it wasn't just about adding more agents. I had to rethink the entire architecture.

1. Why two agents started to fall short

The two-agent pipeline solved a specific problem: generate a persona from data and simulate a conversation with it. For a first experiment, that was enough.

The problem appeared when I tried to use the same method in different scenarios:

What if I had no interviews? Only team hypotheses, secondary data, a few loose benchmarks. The persona would "come out," but with the same visual confidence as a persona built on real data. Dangerous.

And when I wanted to test a concept not yet designed? The Senior UX agent kept simulating "screen interactions" because that's what it knew how to do. The response didn't always make sense. And how could I tell if an answer came from the data or was inferred? I often couldn't. Sometimes the agent mixed both in a single sentence, and I had to manually go back to the data to check.

The first article in this series (Accelerating and refining UX testing with AI and multi-agent systems) had already flagged this in one of its closing points: without a rigorous structure, AI tends to fill gaps with generic assumptions. I saw AI fill a gap with an assumption as if it were evidence, more than once, and realized that more sophisticated prompts weren't going to fix it.

What needed to change wasn't the prompt, but the architecture. If you've been there, you know the feeling: the question shifts from "how do I prompt better?" to "how do I organize the entire system?"

Image caption (pt-br): Overview of the current architecture for synthetic user testing, currently available as a plugin.

2. The turning point: thinking in layers, not agents

The first thing I had to internalize is that "multi-agent" doesn't mean "many agents talking to each other." Quite the opposite.

The initial instinct is to have agents help each other: the UI Critic reviews what the Empathy agent wrote, the Modeler talks to everyone, and so on. It feels collaborative. In practice, it contaminates everything.

When two AI agents can read each other's output, one's inferences become "facts" for the other. The pattern is always the same: the first one ventures a hypothesis ("the user seems anxious"); the second reads it and treats it as an observation ("the user is anxious"); the third proposes actions based on anxiety. By the end of the cycle, the hallucination is woven into the persona as if it were evidence.

The solution was to invert the logic. Instead of agents that communicate, layers that isolate. Each layer has a unique role, receives input from a specific layer, and no one feeds back into anyone else. The flow is unidirectional.

The result was a six-layer architecture:

  • Layer 1: Pre-Processor — structures raw data

  • Layer 2: Analytical skills — analyze in parallel, orchestrated by a Supervisor

  • Layer 3: Modeler — consolidates into a persona

  • Layer 4: Senior UX — simulates the interview

  • Layer 5: Synthesis — integrates and evaluates

  • Layer 6: Researcher — audits traceability

The "Researcher" changed roles between the two articles (revisit the old structure here). Before, it created personas; now it audits what was created. That was one of the best decisions in the project: separating who builds from who checks.

3. The three rules that sustain the architecture

Having six layers means nothing on its own. What makes the whole thing work are the three rules that govern the flow between them.

3.1 Only one layer touches raw data

The Pre-Processor is the only agent with direct access to the input folders (interviews, analytics, videos, screens, hypotheses, etc.). All subsequent layers operate exclusively on its structured output.

Why does this matter? Because raw data is too open to interpretation. When multiple agents read the original transcript, each one reads the same sentence differently. When only one agent reads and structures it, everyone else starts from the same base, and any divergence downstream becomes a signal of something real, not a difference in reading.

3.2 Analytical layer skills don't read each other

Layer 2 has several skills running in parallel: Timekeeper (measures time and friction), Empathy (maps observably expressed emotions), Quantitative (analytics), UI Critic, Hypotheses, Benchmarks, and so on. Which ones are active depends on the operating mode.

In a usability test with video, for example, Timekeeper and UI Critic are active; in a proto-persona without real data, Hypotheses and Benchmarks come in.

All of them receive the same input: the Pre-Processor's output. None of them reads what the others wrote. The layer orchestrator (a Supervisor that coordinates the skills without reading each one's output) delivers the full set to the Modeler in the next layer.

This isolation was the hardest thing for me to accept at first. It felt wasteful. Now I see it as the opposite: it's what allows Layer 5 (Synthesis) to detect genuine convergence between independent analyses. If three isolated skills reached the same conclusion by different paths, that's worth far more than if they had influenced each other.

3.3 Every claim needs a source

This is the most annoying rule to implement and the one that most protects against hallucination. It's a basic principle of responsible AI. Every item produced by any skill must carry a classification:

  • Explicit data: directly observed;

  • Inferred: logical connection not directly observed;

  • Secondary data: external research, report;

  • Benchmark: similar product, published case;

  • Team hypothesis: team assumption with no external data;

  • Absence: no data found, do not fill in.

It sounds overly formal, right? But when the Researcher audits at the end, it can precisely separate what is grounded in evidence from what is inference. And when I, as a designer, present the persona to the team, I know exactly where I can make a confident claim and where I need to say "this is a hypothesis — it needs validation."

Image caption (pt-br): The Researcher agent flagging inferences drawn from the consulted sources, in the context of a proto-persona study focused on apartment purchasing.

4. Checkpoints: where the system can stop

One thing I learned: trusting that AI will "deliver well" is the most expensive way to discover that it didn't. That's why the architecture has two critical checkpoints mid-cycle, points where the cycle stalls if any criterion isn't met.

  • Checkpoint 1: after the Pre-Processor. Before triggering the parallel analyses, the system checks: were all folders processed? Does each item have a source classification? Does each item have an origin reference (folder, file, excerpt)? Were transformations declared? If running in a real-data mode, is there an absence of personally identifiable information?

If any of these fails, the cycle stops. It does not advance to layer 2. It flags what's missing.

  • Checkpoint 2: after the Modeler. Before moving to simulation, it checks whether the persona has all required sections, whether each characteristic has classified support, whether a decision pattern is mapped. If it's a proto-persona, it checks whether the language is conditional, whether confidence is declared as low, whether it's identified as a proto-persona in the header.

At the end, after Synthesis, comes the Researcher's audit, which can approve, approve with reservations, or reject. Rejected? It goes back for revision.

These checkpoints make the cycle take longer. In return, I arrive at the end with an artifact I can trust. And when I can't, the system itself tells me where to look.

5. Persona or proto-persona: the case when data doesn't exist

Perhaps the part that most reorganized my thinking was separating persona from proto-persona as distinct artifacts, not as "the same thing at different stages."

A synthetic persona is born when there's real data — interviews, analytics, videos. Medium to high confidence. Descriptive language.

A proto-persona is born when there isn't — only hypotheses, secondary data, benchmarks. Low confidence. Conditional language required.

The difference in practice:

  • Persona (with real data):

    • "The user paused for 8 seconds before clicking";

    • "3 out of 5 users abandoned the flow at this point";

    • "There's a long pause followed by reformulation."

  • Proto-persona (without real data):

    • "The audience would possibly face difficulty";

    • "If the hypothesis holds, this profile would likely...";

    • "Secondary data suggests that...".

Image caption (pt-br): proto-persona presentation flagging hypothetical responses.

It seems like a linguistic detail, but it's a structural detail. When the Modeler writes "the user feels X" without real data, it turns a hypothesis into a statement. When it writes "the audience would possibly feel X," it preserves the uncertainty, and whoever reads it will make a different decision.

And one rule that was hard to accept but makes sense: proto-personas don't evolve into personas. When real data arrives, you don't "update" — you start from scratch. Because a proto-persona is by nature hypothetical; mixing it with real data creates a hybrid where no one can distinguish what was evidence from what was assumption.

6. What changes in the designer's role (and what doesn't)

Going back to the question that opened the first article: will AI replace the research designer? Still no. But the answer has become more nuanced.

What changed with the multi-agent architecture

I spend less time "checking if AI made something up." The system segregates evidence from inference by default. I can handle scenarios where I have no real data, before I used to avoid personas in those cases. Now I generate proto-personas that are conscious of their own limitations. The artifact I deliver to the team comes with built-in auditing. Confidence is not just a feeling of mine, it's a declared field.

What didn't change

Synthetic persona is still a probabilistic structure, not a real person. Simulation does not replace empirical validation. The caveat remains mandatory. Continuous field research is what keeps data fresh. Without it, any persona, synthetic or not, becomes a time capsule.

And perhaps the most important change: the architecture forced me to think before simulating. What kind of test do I want to run? Do I have real data? What confidence level does the result need to have to support a decision? These are questions I used to skip straight to the prompt for. Now they're the first step in the cycle.

Leading with method

If the first article was about discovering that synthetic personas work, this one was about discovering that they work better when you impose structural limits on AI, not when you give it freedom.

The intuition is the opposite. We want agents that are "smart," "creative," "connecting dots." In practice, it's the isolation, the mandatory source classification, and the auditing that produce a trustworthy artifact. The creativity of generative AI, without AI governance, is exactly what undermines confidence in the result.

Here at Taqtile, that learning became a plugin: a reusable architecture that any designer on the team can run on any project — technology applied in practice, with the same evidence filters running underneath.

The idea isn't to lock the method inside one person, we want to turn the method into something the team can build and iterate on together.

What about you? Have you tried structuring your own synthetic personas in layers, or are you still in the "two smart agents" phase?