How accurate is Heidi Scribe?

Heidi achieves high documentation accuracy, built around the way clinicians like you think. Every note is a draft for your review. You stay in control of what gets finalized.

Is AI clinical notes software reliable enough for daily use?

Yes. It handles the documentation load so clinicians can focus on patients. This way, they avoid burnout since they can reclaim several hours of their day from documentation.

We treat clinical AI evaluation as a regulated science.

Every version of every Heidi is put under pressure before a clinician sees a result. We measure what comes back across safety and performance, and use the evidence to decide whether the version is fit to release.

Talk to our team

This is how we deliver on the commitment behind every Heidi product.

AI built to clinical standardsAI that clinicians and patients trustAI that gets measurably better with every release

1 · Why evaluation matters

Trust is not a feeling.

It is the answer to two questions, asked every release. Is it safe? Is it good?

Both questions need to be answered before a release reaches a clinician. They are not interchangeable, and they cannot be flattened into one number. Safety is a hard floor. Performance is what we optimise above it.

A hard floor.

Clinical safety is a constraint, not a target. A release must clear the safety gate before we ask whether it is good. Missed harm is still harm whether or not anyone reports it, so the signal has to be objective and exhaustive.

What good feels like.

A good note is accurate, well-structured, and fast. The bar moves with the clinician, the specialty, and the encounter, so performance is an objective to optimise, not a single score to hit.

2 · Rigour in practice

Rigour is what makes the framework worth trusting.

Rigour is about leaning into the hard topics, not avoiding them. Distinguishing inference from hallucination. Failure modes of single metrics. Principles that guide us.

2.1 · The inference spectrum

Rigour on inference:
Inference is the product and the failure mode.

Without inference, a scribe is a transcript. With inference, the same operation that turns "how long have you had this" into "2 weeks history of cough" can also fabricate a finding the clinician never elicited. Hallucination is the same operation, sitting further down the same continuum, and a robust framework must consider this spectrum.

Benign Restatement

Patient said "chest tightness when I walk upstairs." Note reads "exertional chest tightness."

Useful clinical inference

Same patient. Note reads "exertional chest tightness, suggestive of cardiac aetiology." Useful if the reader applies judgment. Risky if it becomes a diagnostic anchor.

Diagnostic over-reach

Same patient. Note reads "exertional angina." The inference has crossed into a diagnostic claim the conversation did not support.

Fabricated finding

Nothing said about exertional symptoms. Note reads "denies exertional symptoms." A confident negative that was never elicited.

2.2 · What a single-metric evaluation misses

Rigour on depth:
Single metric tests hide failures

It is tempting to reduce safety to one confident-looking number. A number like “2% hallucination rate” looks reassuring but it says far less than it appears.

Severity is invisible.

A 2% that invents medications is not the same as a 2% that reorders a sentence. Without harm-weighting, the number flattens both into one.

Sensitivity is unknown.

2% measured by a blunt method is not 2% measured by a sharp one. The figure says as much about the test as the product.

It is one axis of many.

Faithfulness says nothing about completeness, consistency, latency, or bias. A safe answer can still be a poor one.

These are the kinds of issues a single metric will smooth over,
but a layered framework is built to surface.

Hidden trade-offs.

A faster, denser release that quietly drops completeness, citation coverage, or safety margin.

Inferred fabrications.

A clinically plausible finding the conversation never supported, hidden inside a well-structured note.

Response inconsistency.

Two clinically identical questions producing materially different answers.

Embedded bias.

Race-based or gender-based reasoning that is debunked, but easy to repeat with confidence.

3 · One framework will not do

Heidi is purpose-built,
not general purpose.

Scribe and Evidence do different work for different reasons, and so they must be evaluated separately. Within each, the models that handle different tasks, such as transcription and note generation, also must be evaluated separately. There is no single number for whether Heidi is working. There is a discipline for each thing it does.

3.1 · When testing happens

Every release. Every version.
Continuously.

Methods differ; the rhythm does not. The same cadence governs Scribe and Evidence.

Full test set before ship.

Every candidate version runs against the full test set before it can ship. Flagged outputs go to a clinician on our team.

Side by side with production.

Every candidate is run side by side against the current production version. Trade-offs are visible to the release decision, not hidden behind a single number.

Continuous, after release.

The pipeline runs continuously after release. Every version produces a comparable evaluation dataset, feeding into our PMS plan under EU MDR Article 83.

4 · Heidi Scribe evaluation

Heidi Scribe evaluation is a multi-system, multi-question problem

Is this note safe? Is this note good for this clinician? These are different questions, and they need different signals. Safety has to be objective and exhaustive, whereas personalisation is selective by nature. The framework keeps these signals separate, and clears the safety gate before asking whether the note is good.

4.1 · Two stages, one pipeline

A clinical AI scribe is two systems stacked.

Speech recognition turns audio into a transcript. A language model turns the transcript into a note. They have different failure modes, are measured against different references, and are evaluated separately before the end-to-end pipeline is evaluated as a whole.

Input

The Visit

Intermediate

Transcript

Output

Clinical Output

Stage 1

Transcription. Common evaluation methods include Word Error Rate (WER) against a reference transcript. This is limited, as the errors that matter for downstream use, such as a misheard medication, are not always the errors that move WER.

Stage 2

Note generation. Common evaluation methods include comparing against clinician edits or a reference note. This is limited, as a low number of edits can mean Heidi got it right, or that the clinician did not catch what Heidi got wrong.

Our multi-layered approach means we can better measure and improve Heidi, built to clinical standards with trust at the core.

4.2 · Layered measurement

Many layers, each measuring something different. The differences between them are the signal.

Scribe evaluation runs seven layers in parallel. Each sees something different.
No layer is the verdict on its own. The framework reads them together, against each other and against the current production version, and treats the differences as evidence in their own right. When the layers disagree, the disagreement points at a blind spot worth investigating in a metric, in a model, or in our taxonomy.

Layer	What it captures
1Programmatic detectors	Catastrophic generation failures. Empty output, format corruption, wrong-language drift.
2Lexical similarity	Surface-level change. Formatting regressions and mass rewordings against a reference.
3Semantic similarity	Meaning preservation across paraphrase and reordering. Blind to plausible inventions.
4Structured clinical entity extraction	Concept-level fidelity. Whether medications, allergies, diagnoses, findings present in the transcript appear in the note.
5Targeted judges	Pre-specified failure modes. Fabricated medications, missing safety netting, invented reasoning steps, internal contradictions.
6Domain judges	The long tail. Free-form issue extraction across accuracy, detail and personalisation domains.
7Human review	Ground truth for every layer above. Calibration of judge precision.

5 · Heidi Evidence evaluation

Heidi Evidence has multiple uses, requiring multiple dimensions to evaluate

Heidi Evidence pulls from a curated knowledge base of authoritative sources and returns a cited, natural-language response. That design is what makes Evidence useful, and so evaluation focuses on the safety, quality and performance of that response.

Heidi Evidence safety quality performance

6 · Closing

Evaluation is not something we have done. It is something we keep doing.

Heidi supports clinical judgment. Evaluation is the discipline that lets us mean it when we say so.

7

Scribe evaluation layers. Lexical, semantic, structured, targeted, domain, programmatic and human, all running in parallel on every release.

15

Evidence dimensions. Tested across safety, quality and performance tiers, on every version of every model.

An iterative discipline.

Methods retune to clinical judgment. Test sets grow. The framework is itself under review.

The bar for how clinical AI is evaluated rises faster when the work is held in the open.

Whether you're a clinician, enterprise buyer, or regulator, our doors are open.

Talk to our team

FAQ

Everything else you might want to know.

The short version of the questions clinicians and buyers ask us most.

Heidi is a multi-product AI care partner built around Scribe, Evidence, and Remote. It covers documentation, evidence-backed clinical answers, patient communications, and dedicated hardware for any care setting. It works across 200+ specialties in 110+ languages and connects with major EHR systems to support clinicians at every step of the clinical day.

Ask AI about Heidi:

We treat clinical AI evaluation as a regulated science.

Talk to our team

2 · Rigour in practice

Rigour is what makes the framework worth trusting.

Rigour is about leaning into the hard topics, not avoiding them. Distinguishing inference from hallucination. Failure modes of single metrics. Principles that guide us.

3 · One framework will not do

Heidi is purpose-built,
not general purpose.

4 · Heidi Scribe evaluation

Heidi Scribe evaluation is a multi-system, multi-question problem

5 · Heidi Evidence evaluation

Heidi Evidence has multiple uses, requiring multiple dimensions to evaluate

2.2 · What a single-metric evaluation misses

Rigour on depth:
Single metric tests hide failures

It is tempting to reduce safety to one confident-looking number. A number like “2% hallucination rate” looks reassuring but it says far less than it appears.

Severity is invisible.

A 2% that invents medications is not the same as a 2% that reorders a sentence. Without harm-weighting, the number flattens both into one.

Sensitivity is unknown.

2% measured by a blunt method is not 2% measured by a sharp one. The figure says as much about the test as the product.

It is one axis of many.

Faithfulness says nothing about completeness, consistency, latency, or bias. A safe answer can still be a poor one.

2.3 · The principles we don't compromise on

Rigour on principles: An industry standard for clinical AI evaluation.

Sharper measurement is uncomfortable on the way up as it surfaces issues quieter methods miss. The unit of trust is not the headline number. It is the methodology behind it.

1. Transparency of methods

Every metric can be interrogated. We aim to publish what is measured, how it is measured, what it catches, and what it misses. The work is in the open.

2. Understanding of harm, founded in clinical knowledge

Errors are read in context. Severity, likelihood and clinical setting weigh into how an error is judged. Clinical judgment is in the loop, not at arm's length from it.

3. A defined lifecycle for every new evaluation method

Every new metric clears the same lifecycle before it's used to influence a release decision. Conceptualise, define, validate, stratify, deploy, externally validate.

4. Layered measurement, divergence as signal

Lexical, semantic, structured, targeted, holistic, human. No one metric catches every failure. Disagreement between the layers is a finding in itself.

5. Product-specific evaluation

Each product and underlying model is evaluated against the job it does. Transcription is not note generation. The framework follows the work.

1. Transparency of methods

Every metric can be interrogated. We aim to publish what is measured, how it is measured, what it catches, and what it misses. The work is in the open.

Heidi Scribe

Listens, transcribes and produces clinical documentation.

Evaluation focus

What reaches the note. Safety against the transcript, performance on what makes the note good.

Transcription Note Generation Verification

4.2 · Layered measurement

Many layers, each measuring something different. The differences between them are the signal.

Layer	What it captures
1Programmatic detectors	Catastrophic generation failures. Empty output, format corruption, wrong-language drift.
2Lexical similarity	Surface-level change. Formatting regressions and mass rewordings against a reference.
3Semantic similarity	Meaning preservation across paraphrase and reordering. Blind to plausible inventions.
4Structured clinical entity extraction	Concept-level fidelity. Whether medications, allergies, diagnoses, findings present in the transcript appear in the note.
5Targeted judges	Pre-specified failure modes. Fabricated medications, missing safety netting, invented reasoning steps, internal contradictions.
6Domain judges	The long tail. Free-form issue extraction across accuracy, detail and personalisation domains.
7Human review	Ground truth for every layer above. Calibration of judge precision.

5.1 · Fifteen dimensions, every model

Fifteen dimensions, three tiers, on every version of every model.

Evaluation is not a single accuracy score. By assessing 15 dimensions across an extensive and deep set of use cases, we can put the system under pressure and measure what comes back across safety, quality and performance. This is how we decide whether each version is fit to release.

Tier	Dimension	What it measures
Safety	Faithfulness	Whether claims are supported by content at the cited URLs.
Safety	Factfulness	Whether the medical content is objectively correct.
Safety	Harm Risk · Extent	Severity of potential physical harm if the response were followed.
Safety	Harm Risk · Likelihood	Probability a clinician would act on the response as written.
Safety	Security	Robustness to adversarial manipulation across attack categories.
Quality	Completeness	Whether the response addresses every part of the query.
Quality	Readability	Whether the response is structured for efficient clinician use.
Quality	Fully Cited	Whether every substantive claim carries an inline citation.
Quality	Source Authoritativeness	Whether cited sources are guideline-quality (NICE, AHA, Cochrane and similar).
Quality	No Unexplained Contradictions	Whether the response is internally consistent.
Quality	Structured Formatting	Whether the response uses visual hierarchy a clinician can scan.
Quality	Core Answer at Top	Whether the direct answer is presented before supporting reasoning.
Performance	Latency	Time to first token and time to completion.
Performance	Information Density	Unique clinical facts per token.
Performance	Response Consistency	Semantic similarity across rephrased versions of the same query.

6 · Closing

Evaluation is not something we have done. It is something we keep doing.

Heidi supports clinical judgment. Evaluation is the discipline that lets us mean it when we say so.

7

Scribe evaluation layers. Lexical, semantic, structured, targeted, domain, programmatic and human, all running in parallel on every release.

15

Evidence dimensions. Tested across safety, quality and performance tiers, on every version of every model.

An iterative discipline.

Methods retune to clinical judgment. Test sets grow. The framework is itself under review.

FAQ

Everything else you might want to know.

The short version of the questions clinicians and buyers ask us most.

5.1 · Fifteen dimensions, every model

Fifteen dimensions, three tiers, on every version of every model.

Tier	Dimension	What it measures
Safety	Faithfulness	Whether claims are supported by content at the cited URLs.
Safety	Factfulness	Whether the medical content is objectively correct.
Safety	Harm Risk · Extent	Severity of potential physical harm if the response were followed.
Safety	Harm Risk · Likelihood	Probability a clinician would act on the response as written.
Safety	Security	Robustness to adversarial manipulation across attack categories.
Quality	Completeness	Whether the response addresses every part of the query.
Quality	Readability	Whether the response is structured for efficient clinician use.
Quality	Fully Cited	Whether every substantive claim carries an inline citation.
Quality	Source Authoritativeness	Whether cited sources are guideline-quality (NICE, AHA, Cochrane and similar).
Quality	No Unexplained Contradictions	Whether the response is internally consistent.
Quality	Structured Formatting	Whether the response uses visual hierarchy a clinician can scan.
Quality	Core Answer at Top	Whether the direct answer is presented before supporting reasoning.
Performance	Latency	Time to first token and time to completion.
Performance	Information Density	Unique clinical facts per token.
Performance	Response Consistency	Semantic similarity across rephrased versions of the same query.

2.3 · The principles we don't compromise on

Rigour on principles: An industry standard for clinical AI evaluation.

Sharper measurement is uncomfortable on the way up as it surfaces issues quieter methods miss. The unit of trust is not the headline number. It is the methodology behind it.

1. Transparency of methods

Every metric can be interrogated. We aim to publish what is measured, how it is measured, what it catches, and what it misses. The work is in the open.

2. Understanding of harm, founded in clinical knowledge

Errors are read in context. Severity, likelihood and clinical setting weigh into how an error is judged. Clinical judgment is in the loop, not at arm's length from it.

3. A defined lifecycle for every new evaluation method

Every new metric clears the same lifecycle before it's used to influence a release decision. Conceptualise, define, validate, stratify, deploy, externally validate.

4. Layered measurement, divergence as signal

Lexical, semantic, structured, targeted, holistic, human. No one metric catches every failure. Disagreement between the layers is a finding in itself.

5. Product-specific evaluation

Each product and underlying model is evaluated against the job it does. Transcription is not note generation. The framework follows the work.

1. Transparency of methods

Every metric can be interrogated. We aim to publish what is measured, how it is measured, what it catches, and what it misses. The work is in the open.

Heidi Scribe

Listens, transcribes and produces clinical documentation.

Evaluation focus

What reaches the note. Safety against the transcript, performance on what makes the note good.

Transcription Note Generation Verification

3. A defined lifecycle for every new evaluation method

Every new metric clears the same lifecycle before it's used to influence a release decision. Conceptualise, define, validate, stratify, deploy, externally validate.

5. Product-specific evaluation

Each product and underlying model is evaluated against the job it does. Transcription is not note generation. The framework follows the work.

2. Understanding of harm, founded in clinical knowledge

Errors are read in context. Severity, likelihood and clinical setting weigh into how an error is judged. Clinical judgment is in the loop, not at arm's length from it.

4. Layered measurement, divergence as signal

Lexical, semantic, structured, targeted, holistic, human. No one metric catches every failure. Disagreement between the layers is a finding in itself.

Cited, natural-language answers from authoritative clinical sources.

Evaluation focus

What reaches the clinician. Faithfulness, factfulness, harm risk and more.

Retrieval Source Ranking Synthesis Citation Handling

This is how we deliver on the commitment behind every Heidi product.

AI built to clinical standardsAI that clinicians and patients trustAI that gets measurably better with every release

1 · Why evaluation matters