Fine-Tuning a Clinical AI Model to Frontier Parity

Scope: this covers out-of-session Evidence, the search and reasoning a clinician does outside a live visit, and, not yet, the in-session work where Evidence reaches into a patient&#x27;s context and takes action.
Timing: the model is rolling into production now rather than already serving every query.

Why bigger isn't always better in clinical AI

You don't need frontier scale to reach frontier quality. You need a reward signal that's yours alone, and a tight loop to learn from it. Six weeks ago, we started replacing the best frontier model running in Heidi Evidence with a model of our own, a fraction of its size. On blind side-by-side evaluation, it has already reached parity, to the point where clinicians can no longer tell which is which.

This post is about how we got there, what the result does and doesn't cover, and why we think the pattern generalizes beyond our own use at Heidi.

The signal only clinicians can give

Evidence is Heidi's clinical search product, free to use outside of a patient session. A clinician asks a question and gets an answer grounded in real sources. Evidence has answered more than 3.5 million questions since launch. It’s not the volume of questions that’s valuable; it's that Evidence answers are backed by something the general-purpose labs can't buy, a real clinician telling us which of two responses was the better one. That preference is the signal we train on.