How Heidi Cut ASR Costs 64% and Latency 75% with NVIDIA…

Heidi processes 2.4 million clinical consultations weekly, with our AI Care Partner supporting 200+ medical specialties in 110 languages. At a scale of over 100 million global sessions thus far - each session representing 10-30 minutes of audio - relying on closed-source, vendor-based Automatic Speech Recognition (ASR) models presented a ceiling for our growth.

For clinicians, two things matter most: accuracy and speed. An error in a clinical note creates liability risk and erodes trust. Latency breaks the flow of care, forcing clinicians to wait when they should be focused on the patient in front of them.

The transition to a custom ASR stack powered by NVIDIA Nemotron ASR and the NVIDIA NeMo framework has been a pivotal moment for clinical AI. By moving beyond "off-the-shelf" APIs, we’ve achieved a level of performance and efficiency that was previously considered unattainable at this scale.

The Cost of Proprietary ASR

As Heidi scaled, we encountered three critical bottlenecks with general-purpose ASR providers:

Economic sustainability: Monthly transcription costs were projected to quadruple within the year, threatening our ability to make clinical AI accessible.
Clinical accuracy gaps: General models often struggled with specialized vocabulary. Errors like "met for men" instead of metformin aren't just typos - they impact the integrity of the clinical record.
Latency & friction: An end-to-end latency of ~3.0 seconds created a "wait state" in interactive clinical workflows that interrupted the physician-patient bond.

The Solution: Fine-Tuning Parakeet V2

We selected the open NVIDIA Parakeet V2 0.6B TDT (“Parakeet V2”) model as our foundation. Its Transducer-based architecture (TDT) balances high-fidelity accuracy with the low-latency requirements of real-time medical scribing.

To achieve clinical-grade accuracy at scale, we needed to move beyond general-purpose ASR models. The solution was fine-tuning NVIDIA Parakeet V2 specifically for medical conversations. This required two critical steps:

1. Precision data curation

Using the NVIDIA NeMo framework, we curated a dataset of 1,500 hours of clinical audio. We implemented an error-focused pipeline: rather than training on "easy" speech, we prioritized the "long tail" of medical terminology where base models typically fail.

2. High-throughput training

Leveraging 8x NVIDIA H100 Tensor Core GPUs, we fine-tuned the 0.6B parameter model in just 12 hours, completing 56,669 training steps across 96 total GPU hours. With each training step processing approximately 48 minutes of audio, we used NeMo's data-loading tools to stream tar shards, eliminating I/O bottlenecks and maximizing GPU utilization.

Results: Clinical Grade, Production Scale

Quality: Outperforming the Baseline

In blind side-by-side evaluations, our fine-tuned Parakeet model achieved a 54.5% win rate against our previous production baseline.

Metric	Heidi fine-tuned Parakeet	Base Parakeet v2	Legacy vendor solution
Non-curated WER	9.4%	11.5%	13%
Medical terms F1 score	93.5%	89.7%	89.5%

MetricHeidi fine-tuned ParakeetBase Parakeet v2Legacy vendor solutionNon-curated WER9.4%11.5%13.0%Medical terms F1 score93.5%89.7%89.5%

Latency: Real-Time Responsiveness

The transition to an in-house stack powered by NVIDIA shortened the feedback loop for clinicians significantly:

Legacy latency: ~3.0 seconds
Nemotron ASR latency: ~0.7 seconds (75%+ improvement)

Impact: Sovereign, Strategic, and Financial Control

64% reduction in OpEx: By moving away from per-minute API pricing for English transcription, we decoupled our growth from vendor tax.
Full model ownership: Owning our weights allows us to deploy across any region or private cloud, ensuring data sovereignty for our global clinical partners.

The Opportunity: Natural Voice Agents for Healthcare

Testing the new NVIDIA Nemotron 3 Speech to Speech (S2S)preview model opened our eyes to what's possible for voice in clinical workflows. Traditional systems pipe audio through three separate steps (ASR → LLM → TTS), creating latency and awkward turn-taking. Nemotron-VoiceChat collapses that into a single model with genuinely natural conversation and near-zero latency.

Why This Matters in Clinical Settings

Patient anxiety is highest during intake calls, scheduling, and pre-procedure consults. Legacy voice systems make it worse: robotic pacing, awkward pauses, inability to handle natural interruptions. Patients hang up. Staff spend hours managing the fallout.

Nemotron 3 S2S fixes the core problem:

Actually sounds human: Tone and pacing that makes patients feel heard
Handles interruptions: Real conversational flow, not ping-pong delays
Open source: We control the model, the data stays local, no vendor lock-in

What This Could Mean for Clinical Front Desks

The potential isn't marginal. If implemented in systems like one of our newest products Heidi Comms, the clinical front desk could shift from administrative bottleneck to patient experience. Fewer dropped calls. Less admin burden. Empathetic support around the clock. Not because we hired more people, but because the tech actually works.

Clinical teams focus on care. Patients get help when they need it.

We're exploring how to integrate Nemotron-VoiceChat into our frameworks. Early testing has been promising, and we're keen to continue pushing what's possible with voice-native models in healthcare settings.

The Future: Edge and Multilingual Expansion

Our collaboration with NVIDIA is just beginning. Our roadmap includes:

On-device ASR: Bringing Nemotron ASR to mobile devices to enable offline, zero-latency transcription.
Global scale: Extending our fine-tuning methodology to all 110 supported languages.
Real-time learning: Implementing a "data flywheel" where edge-case failures are automatically flagged for the next training epoch.

With NVIDIA's infrastructure and our clinical expertise, we're not just improving voice AI. We're setting the standard for what it should be in healthcare.

Ask AI about Heidi: