Skip to main content

Clinicians: Help us shape the future of healthcare. Take the survey

Heidi AI
Log inGet Heidi free
Heidi AI

Heidi. By your side.

© 2026 Heidi. All rights reserved.

Specialties

  • Family Medicine

  • Specialists

  • Nurses

  • Mental Health

  • Allied Health

  • Dentists

  • Veterinarians

  • Trainees

Compliance

  • Safety

  • Trust Center

  • AU/NZ

  • Canada

  • UK

  • GDPR

  • HIPAA

Product

  • Pricing

  • Changelog

  • Downloads

  • Heidi Guides

  • Help Centre

  • System Status

  • System Requirements

  • AI Instructions

About Us

  • Contact Us

  • Company

  • Customer Stories

  • Media

  • Open Roles

    10+
  • People

  • Partnerships

Resources

  • Blog

  • ROI Calculator

  • Resource Centre

  • Template Community

  • FAQs

Legal

  • Privacy Policy

  • Terms of Service

  • Usage Policy

  • UKGDPR Policy

  • Accessibility

Ask AI about Heidi:

We treat clinical AI evaluation as a regulated science.

Every version of every Heidi is put under pressure before a clinician sees a result. We measure what comes back across safety and performance, and use the evidence to decide whether the version is fit to release.

Talk to our team
clinical AI safety and performance
2 · Rigour in practice

Rigour is what makes the framework worth trusting.

Rigour is about leaning into the hard topics, not avoiding them. Distinguishing inference from hallucination. Failure modes of single metrics. Principles that guide us.

hallucination rigour
3 · One framework will not do

Heidi is purpose-built,
not general purpose.

Scribe and Evidence do different work for different reasons, and so they must be evaluated separately. Within each, the models that handle different tasks, such as transcription and note generation, also must be evaluated separately. There is no single number for whether Heidi is working. There is a discipline for each thing it does.

Heidi Scribe and Evidence evaluation
4 · Heidi Scribe evaluation

Heidi Scribe evaluation is a multi-system, multi-question problem

Is this note safe? Is this note good for this clinician? These are different questions, and they need different signals. Safety has to be objective and exhaustive, whereas personalisation is selective by nature. The framework keeps these signals separate, and clears the safety gate before asking whether the note is good.

Heidi Scribe safe and good
5 · Heidi Evidence evaluation

Heidi Evidence has multiple uses, requiring multiple dimensions to evaluate

Heidi Evidence pulls from a curated knowledge base of authoritative sources and returns a cited, natural-language response. That design is what makes Evidence useful, and so evaluation focuses on the safety, quality and performance of that response.

Heidi Evidence safety quality performance
2.2 · What a single-metric evaluation misses

Rigour on depth:
Single metric tests hide failures

It is tempting to reduce safety to one confident-looking number. A number like “2% hallucination rate” looks reassuring but it says far less than it appears.

Severity is invisible.

A 2% that invents medications is not the same as a 2% that reorders a sentence. Without harm-weighting, the number flattens both into one.

Sensitivity is unknown.

2% measured by a blunt method is not 2% measured by a sharp one. The figure says as much about the test as the product.

It is one axis of many.

Faithfulness says nothing about completeness, consistency, latency, or bias. A safe answer can still be a poor one.

Heidi Scribe

Listens, transcribes and produces clinical documentation.

Evaluation focus

What reaches the note. Safety against the transcript, performance on what makes the note good.

Transcription Note Generation Verification

Heidi Evidence

Cited, natural-language answers from authoritative clinical sources.

Evaluation focus

What reaches the clinician. Faithfulness, factfulness, harm risk and more.

Retrieval Source Ranking Synthesis Citation Handling
4.2 · Layered measurement

Many layers, each measuring something different. The differences between them are the signal.

Scribe evaluation runs seven layers in parallel. Each sees something different.
No layer is the verdict on its own. The framework reads them together, against each other and against the current production version, and treats the differences as evidence in their own right. When the layers disagree, the disagreement points at a blind spot worth investigating in a metric, in a model, or in our taxonomy.

LayerWhat it captures
1Programmatic detectorsCatastrophic generation failures. Empty output, format corruption, wrong-language drift.
2Lexical similaritySurface-level change. Formatting regressions and mass rewordings against a reference.
3Semantic similarityMeaning preservation across paraphrase and reordering. Blind to plausible inventions.
4Structured clinical entity extractionConcept-level fidelity. Whether medications, allergies, diagnoses, findings present in the transcript appear in the note.
5Targeted judgesPre-specified failure modes. Fabricated medications, missing safety netting, invented reasoning steps, internal contradictions.
6Domain judgesThe long tail. Free-form issue extraction across accuracy, detail and personalisation domains.
7Human reviewGround truth for every layer above. Calibration of judge precision.
6 · Closing

Evaluation is not something we have done. It is something we keep doing.

Heidi supports clinical judgment. Evaluation is the discipline that lets us mean it when we say so.

7

Scribe evaluation layers. Lexical, semantic, structured, targeted, domain, programmatic and human, all running in parallel on every release.

15

Evidence dimensions. Tested across safety, quality and performance tiers, on every version of every model.

An iterative discipline.

Methods retune to clinical judgment. Test sets grow. The framework is itself under review.

5.1 · Fifteen dimensions, every model

Fifteen dimensions, three tiers, on every version of every model.

Evaluation is not a single accuracy score. By assessing 15 dimensions across an extensive and deep set of use cases, we can put the system under pressure and measure what comes back across safety, quality and performance. This is how we decide whether each version is fit to release.

TierDimensionWhat it measures
SafetyFaithfulnessWhether claims are supported by content at the cited URLs.
SafetyFactfulnessWhether the medical content is objectively correct.
SafetyHarm Risk · ExtentSeverity of potential physical harm if the response were followed.
SafetyHarm Risk · LikelihoodProbability a clinician would act on the response as written.
SafetySecurityRobustness to adversarial manipulation across attack categories.
QualityCompletenessWhether the response addresses every part of the query.
QualityReadabilityWhether the response is structured for efficient clinician use.
QualityFully CitedWhether every substantive claim carries an inline citation.
QualitySource AuthoritativenessWhether cited sources are guideline-quality (NICE, AHA, Cochrane and similar).
QualityNo Unexplained ContradictionsWhether the response is internally consistent.
QualityStructured FormattingWhether the response uses visual hierarchy a clinician can scan.
QualityCore Answer at TopWhether the direct answer is presented before supporting reasoning.
PerformanceLatencyTime to first token and time to completion.
PerformanceInformation DensityUnique clinical facts per token.
PerformanceResponse ConsistencySemantic similarity across rephrased versions of the same query.
FAQ

Everything else you might want to know.

The short version of the questions clinicians and buyers ask us most.

Heidi is a multi-product AI care partner built around Scribe, Evidence, Comms, and Remote. It covers documentation, evidence-backed clinical answers, patient communications, and dedicated hardware for any care setting. It works across 200+ specialties in 110+ languages and connects with major EHR systems to support clinicians at every step of the clinical day.

2.3 · The principles we don't compromise on

Rigour on principles: An industry standard for clinical AI evaluation.

Sharper measurement is uncomfortable on the way up as it surfaces issues quieter methods miss. The unit of trust is not the headline number. It is the methodology behind it.

1. Transparency of methods

Every metric can be interrogated. We aim to publish what is measured, how it is measured, what it catches, and what it misses. The work is in the open.

2. Understanding of harm, founded in clinical knowledge

Errors are read in context. Severity, likelihood and clinical setting weigh into how an error is judged. Clinical judgment is in the loop, not at arm's length from it.

3. A defined lifecycle for every new evaluation method

Every new metric clears the same lifecycle before it's used to influence a release decision. Conceptualise, define, validate, stratify, deploy, externally validate.

4. Layered measurement, divergence as signal

Lexical, semantic, structured, targeted, holistic, human. No one metric catches every failure. Disagreement between the layers is a finding in itself.

5. Product-specific evaluation

Each product and underlying model is evaluated against the job it does. Transcription is not note generation. The framework follows the work.

1. Transparency of methods

Every metric can be interrogated. We aim to publish what is measured, how it is measured, what it catches, and what it misses. The work is in the open.

3. A defined lifecycle for every new evaluation method

Every new metric clears the same lifecycle before it's used to influence a release decision. Conceptualise, define, validate, stratify, deploy, externally validate.

5. Product-specific evaluation

Each product and underlying model is evaluated against the job it does. Transcription is not note generation. The framework follows the work.

2. Understanding of harm, founded in clinical knowledge

Errors are read in context. Severity, likelihood and clinical setting weigh into how an error is judged. Clinical judgment is in the loop, not at arm's length from it.

4. Layered measurement, divergence as signal

Lexical, semantic, structured, targeted, holistic, human. No one metric catches every failure. Disagreement between the layers is a finding in itself.

This is how we deliver on the commitment behind every Heidi product.

AI built to clinical standardsAI that clinicians and patients trustAI that gets measurably better with every release
1 · Why evaluation matters

Trust is not a feeling.

It is the answer to two questions, asked every release. Is it safe? Is it good?

Both questions need to be answered before a release reaches a clinician. They are not interchangeable, and they cannot be flattened into one number. Safety is a hard floor. Performance is what we optimise above it.

A hard floor.

Clinical safety is a constraint, not a target. A release must clear the safety gate before we ask whether it is good. Missed harm is still harm whether or not anyone reports it, so the signal has to be objective and exhaustive.

What good feels like.

A good note is accurate, well-structured, and fast. The bar moves with the clinician, the specialty, and the encounter, so performance is an objective to optimise, not a single score to hit.

2.1 · The inference spectrum

Rigour on inference:
Inference is the product and the failure mode.

Without inference, a scribe is a transcript. With inference, the same operation that turns "how long have you had this" into "2 weeks history of cough" can also fabricate a finding the clinician never elicited. Hallucination is the same operation, sitting further down the same continuum, and a robust framework must consider this spectrum.

1
Benign Restatement

Patient said "chest tightness when I walk upstairs." Note reads "exertional chest tightness."

2
Useful clinical inference

Same patient. Note reads "exertional chest tightness, suggestive of cardiac aetiology." Useful if the reader applies judgment. Risky if it becomes a diagnostic anchor.

3
Diagnostic over-reach

Same patient. Note reads "exertional angina." The inference has crossed into a diagnostic claim the conversation did not support.

4
Fabricated finding

Nothing said about exertional symptoms. Note reads "denies exertional symptoms." A confident negative that was never elicited.

These are the kinds of issues a single metric will smooth over,
but a layered framework is built to surface.

Hidden trade-offs.

A faster, denser release that quietly drops completeness, citation coverage, or safety margin.

Inferred fabrications.

A clinically plausible finding the conversation never supported, hidden inside a well-structured note.

Response inconsistency.

Two clinically identical questions producing materially different answers.

Embedded bias.

Race-based or gender-based reasoning that is debunked, but easy to repeat with confidence.

3.1 · When testing happens

Every release. Every version.
Continuously.

Methods differ; the rhythm does not. The same cadence governs Scribe and Evidence.

Full test set before ship.

Every candidate version runs against the full test set before it can ship. Flagged outputs go to a clinician on our team.

Side by side with production.

Every candidate is run side by side against the current production version. Trade-offs are visible to the release decision, not hidden behind a single number.

Continuous, after release.

The pipeline runs continuously after release. Every version produces a comparable evaluation dataset, feeding into our PMS plan under EU MDR Article 83.

4.1 · Two stages, one pipeline

A clinical AI scribe is two systems stacked.

Speech recognition turns audio into a transcript. A language model turns the transcript into a note. They have different failure modes, are measured against different references, and are evaluated separately before the end-to-end pipeline is evaluated as a whole.

Input
The Visit
Intermediate
Transcript
Output
Clinical Output

Stage 1

Transcription. Common evaluation methods include Word Error Rate (WER) against a reference transcript. This is limited, as the errors that matter for downstream use, such as a misheard medication, are not always the errors that move WER.

Stage 2

Note generation. Common evaluation methods include comparing against clinician edits or a reference note. This is limited, as a low number of edits can mean Heidi got it right, or that the clinician did not catch what Heidi got wrong.

Our multi-layered approach means we can better measure and improve Heidi, built to clinical standards with trust at the core.

The bar for how clinical AI is evaluated rises faster when the work is held in the open.

Whether you're a clinician, enterprise buyer, or regulator, our doors are open.

Talk to our team