Published: 09-03-2026 16:19 | Updated: 09-03-2026 16:27

A new paradigm for medical AI: why disagreement between models may be more valuable than consensus

Photo: Getty Images

When an AI system in health care gives a confident answer, should clinicians trust it? In a new article from Frontiers in Artificial Intelligence, KI researcher Farhad Abtahi and colleagues argue that the question itself is wrong, and that the entire field needs to rethink how medical AI systems are designed. Here he develops his reasoning.

Farhad Abtahi, Senior Research Infrastructure Specialist at CLINTEC.

‘Today, the dominant trend in medical AI is to build a single, powerful model and treat its output as the answer. MEDLEY proposes something fundamentally different. Instead of collapsing multiple perspectives into one consensus, we preserve them. The clinician sees where models agree, where they disagree, and why, because each model’s training background and biases are documented transparently. This shifts the goal from maximizing accuracy through uniformity to enhancing insight through structured diversity. It is a different way of thinking about what medical AI should do’.

Why is keeping the clinician in the loop so central to MEDLEY?

‘Because the final decision in health care must remain with a human being. That is not just a regulatory requirement; it is an ethical imperative. But for human oversight to be meaningful, clinicians need to actually understand what they are overseeing. A single model that presents a confident answer with a polished explanation can actually undermine human judgment by encouraging automation bias, the tendency to over-trust automated systems. Research shows that when AI provides persuasive explanations, clinicians are more likely to accept incorrect outputs without questioning them’.

‘MEDLEY is designed to counteract this. By presenting multiple perspectives, including minority views and patterns of disagreement, the framework keeps clinicians actively engaged in the reasoning process. It functions more like a structured consultation panel than a black-box oracle. The clinician becomes an orchestrator of diverse analytical perspectives, not a passive recipient of a single answer’.

But won’t presenting outputs from many different models overwhelm the doctor?

‘This is a critical design question, and one we address directly in the paper. Cognitive Load Theory tells us that presenting too much information at once can impair rather than enhance decision-making. MEDLEY addresses this through what we call progressive disclosure. The default clinical view shows only the consensus finding with a summary uncertainty indicator. Alternative and minority diagnoses are available, but only when the clinician chooses to expand them, typically for complex or ambiguous cases’.

‘The system also uses threshold-based activation. For routine high-consensus cases, MEDLEY presents streamlined output, reserving the full ensemble plurality for cases where disagreement genuinely adds diagnostic value. Visual encodings like confidence bands and divergence indicators convey ensemble-level patterns without requiring the clinician to process each model’s output individually. The goal is accessibility when plurality matters, invisibility when it does not’.

There is a lot of enthusiasm for Explainable AI right now. You seem skeptical?

‘Not skeptical of the goal, but of the assumption that explanation equals trustworthiness. This is one of the paper’s most important arguments. Large language models are remarkably good at generating explanations that sound clinically plausible: confident, articulate, well-structured. But research increasingly shows that these explanations often bear little relation to how the model actually arrived at its output. We call this “digital sophistry”: persuasive speech disguised as transparency’.

‘The danger is real. A model can give the wrong diagnosis with an impeccable-sounding rationale. If the clinician trusts the explanation, they trust the answer, and the patient may be harmed. Explainability, as currently practiced, can actually make automation bias worse rather than better’.

‘MEDLEY sidesteps this paradox entirely. Instead of relying on any single model’s ability to explain itself, it seeks reliability through the structured interplay of convergent and divergent perspectives across multiple models. It is a dialectic of outputs rather than a performance of reasoning. If several independent models with different training backgrounds agree, that convergence carries genuine evidential weight. If they disagree, that disagreement is a clinically meaningful signal’.

Bias is conventionally seen as a defect in AI. You take a different view?

‘Yes, and this is at the heart of the paradigm shift. Bias in AI reflects the data a model was trained on: which populations, which institutions, which clinical practices. Rather than treating this as purely negative, MEDLEY documents bias as a form of specialization. A model trained predominantly on data from East Asian populations may recognize certain conditions better than one trained in Northern Europe, and vice versa. The key is making these differences transparent so clinicians can weigh them in context’.

‘This does not mean all bias is acceptable. The paper draws clear ethical boundaries: bias that reinforces stereotypes, encodes discriminatory proxies, or substitutes statistical correlation for clinical causation is never acceptable. But a biased model contributing to a transparency-preserving ensemble is ethically distinct from the same model deployed as a standalone decision-maker’.

You built a proof-of-concept. What did it show?

‘To illustrate how MEDLEY works in practice, we developed a demonstrator for differential diagnosis, the clinical task of generating and ranking possible diagnoses, using over 30 large language models with diverse geographic, architectural, and temporal origins. It is important to emphasize that this was one specific application of the paradigm, not MEDLEY itself. The framework is designed to apply broadly across medical AI domains’.

‘For example, in medical imaging, the paper discusses how MEDLEY could make visible where different segmentation models disagree on tumor boundaries. In radiation therapy planning, those disagreements are not noise; they can be vital for avoiding radiation exposure to sensitive structures. Traditional ensemble methods aggregate these outputs into a single boundary and hide the discrepancies. MEDLEY preserves them for the oncologist to evaluate’.

‘In the differential diagnosis demonstrator, we found that consensus rates varied widely across synthetic cases, from around 48% for complex conditions to over 90% for well-established diagnoses. Cases with lower consensus were enriched for rare or region-specific conditions, precisely where multiple perspectives add the most value. In one case, a single model trained on data from the Eastern Mediterranean region flagged a genetic condition that all other models missed. That minority output was presented with transparent provenance, allowing the clinician to decide whether to investigate further’.

‘These are observations from a proof-of-concept using synthetic cases, not validated clinical findings. But they demonstrate that the approach is technically feasible and that meaningful patterns of agreement and disagreement emerge from diverse ensembles’.

How does MEDLEY differ from medical AI systems developed by companies like Google and Microsoft?

‘This is an important distinction. Systems like Google’s AMIE and Microsoft’s MAI-DxO are impressive achievements, but they operate within a fundamentally different paradigm. AMIE, for instance, uses a single large language model trained through self-play, where the model simulates both sides of a doctor-patient conversation. MAI-DxO assigns multiple roles, such as different specialist perspectives, but these roles are still performed by the same underlying model’.

‘MEDLEY is different because it uses genuinely heterogeneous, parallel models: systems built by different teams, trained on different data, with different architectures. The diversity is real, not simulated. When AMIE role-plays as a cardiologist and then as a neurologist, it is still the same model with the same training biases. When MEDLEY orchestrates a model trained on Scandinavian hospital data alongside one trained on Southeast Asian clinical records, those are fundamentally different analytical perspectives. That distinction matters for patient safety’.

You mention health equity. How does MEDLEY address disparities in care?

‘This is one of the most important implications of the paradigm. Current “universal” AI models are often trained predominantly on data from majority populations in high-income countries. Research shows that these models produce substantial errors for minority patients, and that performance disparities persist even after bias assessments. In effect, the pursuit of a single unbiased model can actively worsen care for underrepresented groups, because averaging across populations erases clinically meaningful differences’.

‘MEDLEY takes the opposite approach. By orchestrating models trained on diverse populations and preserving rather than averaging away their perspectives, the framework explicitly protects minority viewpoints. A model developed with data from sub-Saharan Africa, or from an indigenous health service, or from a rare disease registry contributes perspectives that would be diluted or lost in a single universal model. Making each model’s training population transparent also counters the dangerous assumption that AI systems are inherently objective’.

Does MEDLEY have implications beyond diagnosis, for example for AI security?

‘Yes, and this is a connection we find particularly compelling. One of the growing concerns in medical AI is data poisoning, where manipulated training data can steer an AI system toward unsafe behavior. If you rely on a single model and that model is compromised, there is no safety net’.

‘MEDLEY’s architecture provides an inherent defense. Because the models in the ensemble are trained on independent data sources with different architectures, they are unlikely to share the same vulnerabilities. If one model is poisoned, it will suddenly disagree with the rest of the ensemble in ways that can be detected. The same disagreement-monitoring that enriches diagnostic reasoning also functions as an alarm system against adversarial attacks. In our related work on AI security in health care, published in the Journal of Medical Internet Research, we explore this convergence in detail’.

Could MEDLEY change the health care AI market?

‘We believe so. The current market is drifting toward a few large corporations offering single “super models.” MEDLEY creates a fundamentally different dynamic. If diversity is valued, not just raw accuracy from a single source, then smaller companies with specialized models become essential contributors. A startup focused on rare diseases, a regional team with local population data, or an academic group targeting underrepresented populations all gain relevance within the ensemble. Their contributions are devalued in single-model paradigms but essential in MEDLEY’.

‘Conceptually, this resembles an app store ecosystem for medical AI, where diverse contributions coexist and complement each other, democratizing participation and potentially accelerating innovation’.

How does this relate to the EU AI Act?

‘The EU AI Act classifies medical AI as high-risk, requiring bias mitigation, transparency, and human oversight. MEDLEY aligns naturally with these requirements because it makes bias visible by design and keeps the clinician as the final decision-maker. However, the current regulatory landscape does not yet have established pathways for certifying ensemble-level systems, where the value comes from the combination of models rather than any individual one. Developing these pathways is an important next step for both regulators and the research community’.

What needs to happen next?

‘The proof-of-concept establishes technical feasibility. Translating the paradigm to clinical practice requires prospective clinical studies comparing MEDLEY-supported decisions against single-model and unassisted clinician baselines, human factors research on how clinicians interact with multi-model outputs and where cognitive overload becomes a risk, collaboration with regulators on ensemble-level certification, and extension to multimodal data, combining imaging, laboratory values, and clinical notes within the same diversity-preserving architecture. The vision is human-AI collaboration where multiple perspectives enhance rather than burden clinical reasoning, what Kasparov once called “Advanced Chess,” where human-machine teams outperform either component alone’, concludes Farhad Abtahi.