Every version of every Heidi is put under pressure before a clinician sees a result. We measure what comes back across safety and performance, and use the evidence to decide whether the version is fit to release.

Rigour is about leaning into the hard topics, not avoiding them. Distinguishing inference from hallucination. Failure modes of single metrics. Principles that guide us.

Scribe and Evidence do different work for different reasons, and so they must be evaluated separately. Within each, the models that handle different tasks, such as transcription and note generation, also must be evaluated separately. There is no single number for whether Heidi is working. There is a discipline for each thing it does.

Is this note safe? Is this note good for this clinician? These are different questions, and they need different signals. Safety has to be objective and exhaustive, whereas personalisation is selective by nature. The framework keeps these signals separate, and clears the safety gate before asking whether the note is good.

Heidi Evidence pulls from a curated knowledge base of authoritative sources and returns a cited, natural-language response. That design is what makes Evidence useful, and so evaluation focuses on the safety, quality and performance of that response.

It is tempting to reduce safety to one confident-looking number. A number like “2% hallucination rate” looks reassuring but it says far less than it appears.
A 2% that invents medications is not the same as a 2% that reorders a sentence. Without harm-weighting, the number flattens both into one.
2% measured by a blunt method is not 2% measured by a sharp one. The figure says as much about the test as the product.
Faithfulness says nothing about completeness, consistency, latency, or bias. A safe answer can still be a poor one.
Sharper measurement is uncomfortable on the way up as it surfaces issues quieter methods miss. The unit of trust is not the headline number. It is the methodology behind it.
Every metric can be interrogated. We aim to publish what is measured, how it is measured, what it catches, and what it misses. The work is in the open.
Errors are read in context. Severity, likelihood and clinical setting weigh into how an error is judged. Clinical judgment is in the loop, not at arm's length from it.
Every new metric clears the same lifecycle before it's used to influence a release decision. Conceptualise, define, validate, stratify, deploy, externally validate.
Lexical, semantic, structured, targeted, holistic, human. No one metric catches every failure. Disagreement between the layers is a finding in itself.
Each product and underlying model is evaluated against the job it does. Transcription is not note generation. The framework follows the work.
Every metric can be interrogated. We aim to publish what is measured, how it is measured, what it catches, and what it misses. The work is in the open.
Every new metric clears the same lifecycle before it's used to influence a release decision. Conceptualise, define, validate, stratify, deploy, externally validate.
Each product and underlying model is evaluated against the job it does. Transcription is not note generation. The framework follows the work.
Errors are read in context. Severity, likelihood and clinical setting weigh into how an error is judged. Clinical judgment is in the loop, not at arm's length from it.
Lexical, semantic, structured, targeted, holistic, human. No one metric catches every failure. Disagreement between the layers is a finding in itself.
Listens, transcribes and produces clinical documentation.
Evaluation focus
What reaches the note. Safety against the transcript, performance on what makes the note good.
Cited, natural-language answers from authoritative clinical sources.
Evaluation focus
What reaches the clinician. Faithfulness, factfulness, harm risk and more.
Scribe evaluation runs seven layers in parallel. Each sees something different.
No layer is the verdict on its own. The framework reads them together, against each other and against the current production version, and treats the differences as evidence in their own right. When the layers disagree, the disagreement points at a blind spot worth investigating in a metric, in a model, or in our taxonomy.
| Layer | What it captures |
|---|---|
| 1Programmatic detectors | Catastrophic generation failures. Empty output, format corruption, wrong-language drift. |
| 2Lexical similarity | Surface-level change. Formatting regressions and mass rewordings against a reference. |
| 3Semantic similarity | Meaning preservation across paraphrase and reordering. Blind to plausible inventions. |
| 4Structured clinical entity extraction | Concept-level fidelity. Whether medications, allergies, diagnoses, findings present in the transcript appear in the note. |
| 5Targeted judges | Pre-specified failure modes. Fabricated medications, missing safety netting, invented reasoning steps, internal contradictions. |
| 6Domain judges | The long tail. Free-form issue extraction across accuracy, detail and personalisation domains. |
| 7Human review | Ground truth for every layer above. Calibration of judge precision. |
Heidi supports clinical judgment. Evaluation is the discipline that lets us mean it when we say so.
Scribe evaluation layers. Lexical, semantic, structured, targeted, domain, programmatic and human, all running in parallel on every release.
Evidence dimensions. Tested across safety, quality and performance tiers, on every version of every model.
Methods retune to clinical judgment. Test sets grow. The framework is itself under review.
Evaluation is not a single accuracy score. By assessing 15 dimensions across an extensive and deep set of use cases, we can put the system under pressure and measure what comes back across safety, quality and performance. This is how we decide whether each version is fit to release.
| Tier | Dimension | What it measures |
|---|---|---|
| Safety | Faithfulness | Whether claims are supported by content at the cited URLs. |
| Safety | Factfulness | Whether the medical content is objectively correct. |
| Safety | Harm Risk · Extent | Severity of potential physical harm if the response were followed. |
| Safety | Harm Risk · Likelihood | Probability a clinician would act on the response as written. |
| Safety | Security | Robustness to adversarial manipulation across attack categories. |
| Quality | Completeness | Whether the response addresses every part of the query. |
| Quality | Readability | Whether the response is structured for efficient clinician use. |
| Quality | Fully Cited | Whether every substantive claim carries an inline citation. |
| Quality | Source Authoritativeness | Whether cited sources are guideline-quality (NICE, AHA, Cochrane and similar). |
| Quality | No Unexplained Contradictions | Whether the response is internally consistent. |
| Quality | Structured Formatting | Whether the response uses visual hierarchy a clinician can scan. |
| Quality | Core Answer at Top | Whether the direct answer is presented before supporting reasoning. |
| Performance | Latency | Time to first token and time to completion. |
| Performance | Information Density | Unique clinical facts per token. |
| Performance | Response Consistency | Semantic similarity across rephrased versions of the same query. |
The short version of the questions clinicians and buyers ask us most.
Heidi is a multi-product AI care partner built around Scribe, Evidence, Comms, and Remote. It covers documentation, evidence-backed clinical answers, patient communications, and dedicated hardware for any care setting. It works across 200+ specialties in 110+ languages and connects with major EHR systems to support clinicians at every step of the clinical day.
It is the answer to two questions, asked every release. Is it safe? Is it good?
Both questions need to be answered before a release reaches a clinician. They are not interchangeable, and they cannot be flattened into one number. Safety is a hard floor. Performance is what we optimise above it.
Clinical safety is a constraint, not a target. A release must clear the safety gate before we ask whether it is good. Missed harm is still harm whether or not anyone reports it, so the signal has to be objective and exhaustive.
A good note is accurate, well-structured, and fast. The bar moves with the clinician, the specialty, and the encounter, so performance is an objective to optimise, not a single score to hit.
Without inference, a scribe is a transcript. With inference, the same operation that turns "how long have you had this" into "2 weeks history of cough" can also fabricate a finding the clinician never elicited. Hallucination is the same operation, sitting further down the same continuum, and a robust framework must consider this spectrum.
Patient said "chest tightness when I walk upstairs." Note reads "exertional chest tightness."
Same patient. Note reads "exertional chest tightness, suggestive of cardiac aetiology." Useful if the reader applies judgment. Risky if it becomes a diagnostic anchor.
Same patient. Note reads "exertional angina." The inference has crossed into a diagnostic claim the conversation did not support.
Nothing said about exertional symptoms. Note reads "denies exertional symptoms." A confident negative that was never elicited.
These are the kinds of issues a single metric will smooth over,
but a layered framework is built to surface.
A faster, denser release that quietly drops completeness, citation coverage, or safety margin.
A clinically plausible finding the conversation never supported, hidden inside a well-structured note.
Two clinically identical questions producing materially different answers.
Race-based or gender-based reasoning that is debunked, but easy to repeat with confidence.
Methods differ; the rhythm does not. The same cadence governs Scribe and Evidence.
Every candidate version runs against the full test set before it can ship. Flagged outputs go to a clinician on our team.
Every candidate is run side by side against the current production version. Trade-offs are visible to the release decision, not hidden behind a single number.
The pipeline runs continuously after release. Every version produces a comparable evaluation dataset, feeding into our PMS plan under EU MDR Article 83.
Speech recognition turns audio into a transcript. A language model turns the transcript into a note. They have different failure modes, are measured against different references, and are evaluated separately before the end-to-end pipeline is evaluated as a whole.
Transcription. Common evaluation methods include Word Error Rate (WER) against a reference transcript. This is limited, as the errors that matter for downstream use, such as a misheard medication, are not always the errors that move WER.
Note generation. Common evaluation methods include comparing against clinician edits or a reference note. This is limited, as a low number of edits can mean Heidi got it right, or that the clinician did not catch what Heidi got wrong.
Our multi-layered approach means we can better measure and improve Heidi, built to clinical standards with trust at the core.
Whether you're a clinician, enterprise buyer, or regulator, our doors are open.