Method · Delphi method
Synthetic expert panels
Delphi method, reimagined for the agent economy
Convening diverse expert perspectives to reason toward convergence — in minutes, then adjudicated by real people, and never trusted as a solitary oracle.
The classical method
The Delphi method, developed at RAND in the 1950s, structures expert judgement: a panel answers a question independently, sees the anonymised range of answers and reasoning, then revises — iterating until the group converges or the genuine disagreements become clear. It removes the loudest-voice problem and the seniority bias, and produces a defensible collective view on questions that have no data yet.
Its power is the same as the power behind every forecasting tournament: the wisdom of a diverse, aggregated crowd reliably beats most of the individuals inside it. Delphi is simply a disciplined way to manufacture that crowd and let it argue anonymously toward a position that no single expert would have reached alone.
Why the human-run version hit a ceiling
Convening a real Delphi panel is slow and expensive: recruiting genuinely diverse experts, running multiple rounds, chasing responses over weeks or months. The cost forces small panels and few iterations, which undercuts the very diversity that makes the method work — and means the technique is reserved for rare, high-stakes questions. By the time a panel converges, the question has often moved.
How it works with agents
Agents let you run a Delphi panel that is broader, faster and more diverse than any human convening. Persona-agents grounded in different disciplines, regions and schools of thought answer the question independently, critique each other's reasoning anonymously, and iterate toward convergence — or toward a clear, well-mapped disagreement — in minutes rather than months. The breadth that was previously unaffordable becomes the default.
Diversity is the active ingredient, and it is engineered rather than hoped for: the panel runs across multiple underlying models and deliberately divergent personas, because an aggregate of agents that all share an architecture would simply share its blind spots. The spread of opinion across the panel is itself an output — a wide spread flags genuine uncertainty that a single confident answer would have hidden.
This is explicitly not a replacement for human experts. The synthetic panel does the breadth and the iteration at machine scale; real domain experts then adjudicate the output — checking it, correcting it, and owning the final judgement. The configuration that works is the dyad, not the oracle.
What the evidence says
This archetype is unusually well-evidenced, because it sits on top of the LLM-forecasting literature. Halawi et al.'s Approaching Human-Level Forecasting with Language Models builds a retrieval-augmented system whose accuracy on real prediction-market questions approaches the aggregated human crowd — the gold standard in judgemental forecasting. Schoenegger et al.'s Wisdom of the Silicon Crowd shows the same result by a more robust route: an ensemble of twelve diverse LLMs, aggregated the way you would aggregate a human tournament, rivals the human crowd — silicon reproducing the wisdom-of-crowds effect itself.
The decisive piece for an operating model that keeps humans in charge is Schoenegger et al.'s randomised controlled trial AI-Augmented Predictions: LLM Assistants Improve Human Forecasting, which found that pairing human forecasters with an LLM assistant lifted their accuracy by roughly 24–28% — even when the assistant was deliberately handicapped. That is the load-bearing result: not that machines should forecast instead of people, but that the human-plus-panel dyad beats either alone. Park et al.'s Generative Agents supplies the other half — evidence that LLM agents can sustain distinct, believable personas, which is the precondition for a genuinely diverse synthetic panel.
Applied to the agent economy
On the agent economy, the questions that matter most are precisely the ones with little hard evidence yet: how fast will autonomous agents take on real economic decisions? Where will liability land when an agent transacts and something goes wrong? Which institutions adapt and which break? A synthetic panel lets us map the full spread of informed opinion in an afternoon, then bring the sharpest human experts in to settle it.
It also lets us ask these questions far more often. When convening a panel cost months, a unit could afford a handful of Delphi exercises a year; when it costs minutes, every consequential and genuinely uncertain question can get one — and the human experts are spent on adjudication, the part only they can do, rather than on logistics.
Where humans stay in command
The failures here are overconfidence and miscalibration — a panel that is fluent, well-reasoned and wrong, with stated confidence that bears little relation to its hit rate. The International AI Safety Report 2026 catalogues exactly these failure modes for frontier models. So forecasting personas are kept on a standing calibration scoreboard, scored against questions as they resolve; ensemble agreement is required before any answer is surfaced as high-confidence; and long-horizon, low-base-rate, structurally novel questions — where the prediction-market validation is weakest — are routed to mandatory human-and-red-team review rather than treated as decision-ready.
And the human frames the question and owns the verdict. A badly-posed question yields a confidently-precise but meaningless number, so question design stays a human judgement task; and the analyst's published position is informed by the panel but is the analyst's own. The divergence between human and panel is itself recorded as a calibration signal over time — the institution learns where its agents and its people tend to disagree, and who tends to be right.
How we run it
- Convene — assemble diverse persona-agents across disciplines, regions and viewpoints, deliberately spanning multiple underlying models.
- Poll — each answers the question independently, with its reasoning and an explicit confidence band.
- Iterate — agents see the anonymised spread and revise toward convergence, or toward a clearly-mapped disagreement.
- Score — the spread is reported as signal, and each persona's calibration is tracked against questions as they resolve.
- Adjudicate — human experts review, correct and sign off; the human–panel divergence is logged for next time.