Invented medications. Repeated phrases. A family history attributed to the wrong person. A "history of hypertension" the patient never mentioned. Time references that drift by months. A "pain 6/10" the patient never said.
A clinic owner emailed me 14 of her practice's AI-generated notes last week and asked if I'd take a look. She was nervous about something but couldn't put a finger on what. I found six different categories of hallucination across nine of those fourteen notes. She had already signed all nine.
AI scribes have genuinely transformed documentation. Clinicians are getting their evenings back, and the best pipelines produce notes that are faster and more complete than hand-typed charts. The failure mode to understand isn't "AI writes notes" — it's "AI writes notes without being anchored to what was actually said." When a summarization model is asked to produce a SOAP note and given nothing but a rough prompt and training priors, it pattern-matches. When the same model is grounded — forced to justify every sentence against a verbatim transcript of the consult — the failure modes below largely disappear.
This piece walks through the seven red flags I see most often in ungrounded pipelines, then explains how Oli Health's proprietary transcript-grounding approach closes the gap. If you're running any AI scribe, the audit protocol at the end is worth borrowing regardless of vendor.
Why an AI scribe hallucinates in the first place
The short version. An AI scribe is two models stitched together. A speech-to-text model transcribes the audio. A language model rewrites the transcript into a structured note. Both can fail. The language model fails most loudly — and it fails hardest when the note-generation step isn't tied back to the transcript.
The transcription layer hallucinates when the audio is bad. Long pauses, background noise, an HVAC unit, a barking dog, a TV in the waiting room. Anything that introduces low-signal segments. A 2024 academic study of Whisper, a widely-used open-source speech-to-text model, found that roughly 1% of audio segments contained entirely fabricated phrases. About 38% of those fabrications fell into what the authors called explicit-harm categories: perpetuating violence, asserting inaccurate associations between people, or implying false authority. None of those categories are clinical, but the underlying failure mode (fluent text generated from low-signal audio) is exactly what shows up in chart hallucinations. One percent sounds small. A 30-minute consult contains hundreds of audio segments.
The language model hallucinates differently. It pattern-matches. If your patient is a 58-year-old presenting with chest discomfort, the model has been trained on enough cardiology notes that it learns to expect "history of hypertension" near that profile. So it tends to write it, even when nobody said it. The model is not trying to deceive you. It's predicting the next plausible word. Sometimes the next plausible word is a lie.
The fix is structural: the summarization model should never be allowed to produce a sentence that isn't traceable to a specific span of the verbatim transcript. That's grounding. More on how Oli implements it below.
Why does my AI scribe hallucinate?
An AI scribe hallucinates when the underlying speech-to-text model encounters low-signal audio (background noise, long pauses, unclear speech) or when the summarization model pattern-matches your note onto its training data instead of the actual transcript. The cure is transcript grounding — forcing every generated sentence to map back to a specific verbatim span of the consult.
The seven red flags I keep finding in ungrounded pipelines
I've ranked these roughly in order of how often I see them, not how dangerous they are. The dangerous ones are at #3 and #6. Every one of them is a direct consequence of generating notes without anchoring them to the real transcript.
1. Phrase-loop repetition
"The patient reports mild discomfort" appears three times in the same note — once in the subjective, once in the assessment, once in the plan. Or "Tolerated treatment well" loops across the bottom of the chart like a watermark.
This is what happens when a low-signal segment of audio sends the speech-to-text model into a repetition trap, and the summarization model dutifully copies the loop into multiple sections. One of the oldest known failure modes in audio ASR. The fix is mechanical: search every finished note for any 4-word sequence that appears more than once. Most notes shouldn't have any. If you find a loop, suspect the whole section it sits in.
2. Invented medications
A drug name shows up in the note that the patient never said. Often it's a near-rhyme. "Metoprolol" for "metformin." "Zoloft" for something the patient was actually trying to remember.
Usually the audio briefly dropped, or a background sound contained a vowel pattern that resembled a drug name. The model carries thousands of drug names in its vocabulary and is statistically eager to surface them. Cross-check every medication in any AI-generated note against the patient's chart and intake form. If it isn't already in the patient's history, it didn't get there from a real conversation in the room.
3. Fabricated past diagnoses
This is the one that scares me most.
The note says "Patient has a history of hypertension." The patient does not have a history of hypertension. The audio never contained the word. The language model is pattern-matching the patient's demographic profile against typical chart patterns and inserting what it thinks belongs there.
The reason this is the worst category is because the lie sounds completely normal. A reviewer skimming for accuracy will read past it. The reader has to actively distrust the history section to catch this one. My rule: never trust a "history of" line in an AI-generated note unless you can point to where the patient said it. Spot-check against intake forms and prior visits. Always. (In a grounded pipeline, that check is automatic — every "history of" sentence carries a pointer back to the transcript line that supports it, and if no such line exists, the sentence never gets written.)
4. Pronoun drift
A note that opens with "she" switches to "he" two paragraphs in. Or a non-binary patient who uses "they/them" gets defaulted to "he" the entire way through.
The model carries a statistical bias toward "he" for clinical subjects, and when it loses track of which pronoun your patient uses, it drifts to its default. Cmd-F (or Ctrl-F) for "he", "she", and "they" before you sign. It takes ten seconds.
5. Time confusion
The patient said "two months ago." The note says "two weeks ago." "Since the surgery in 2022" becomes "since the surgery in 2002."
Numbers in audio are fragile. They get transcribed poorly under almost any condition, and the downstream model rarely catches the inconsistency. Highlight every time reference on first read of a chart.
6. Family history that ends up attributed to the patient
The patient was describing their mother's stroke at 62. The note says the patient had a stroke at 62.
This one is rare but it's the other category that scares me, alongside #3. The model loses track of who's speaking about whom — pronouns and subjects are genuinely hard to keep straight when family members are being discussed in detail. Read the family history section twice. Slowly. Anything attributed to the patient that the patient didn't claim is the kind of error that ends up in a complaint to your regulator.
7. Fabricated measurements
"Pain 6/10." "BMI 31.2." "Range of motion 45 degrees." Numbers the patient never stated and you never measured.
The model fills in a structured field because it's been trained that clinical notes contain those fields. The number itself is fabricated whole.
I've stopped trusting any number in an AI-generated note unless I can trace it to either a verbal report from the patient or a measurement I took. If neither exists, the number doesn't either.
How Oli Health's transcript grounding prevents these failures
Oli Health was built from the ground up around a single engineering conviction: a generated note should never contain a sentence that can't be justified by the verbatim transcript of the consult. That constraint is the spine of our proprietary grounding pipeline. Here's what that actually means in practice.
Every generated sentence carries a transcript anchor. When our summarization model writes "Patient reports intermittent chest discomfort for three weeks," it also emits a pointer to the specific span of the verbatim transcript where the patient said that. If the model tries to produce a sentence with no supporting span — a "history of hypertension" line pulled from training priors rather than the conversation — the sentence is rejected before it ever reaches the draft chart.
The verbatim transcript is preserved and audited, not thrown away. Many scribes discard the raw transcript the moment the structured note is generated. Oli keeps the verbatim transcription alongside the note, and the grounding layer audits each sentence against it at generation time. Clinicians can click any line in the draft and see the exact transcript passage that supports it. If the passage doesn't say what the note claims, you see that immediately.
Drug names, past diagnoses, and measurements get double-grounded. Beyond the transcript anchor, these high-risk categories are cross-referenced against the patient's existing chart and intake form. A medication that appears in the generated note but has no transcript span and no prior chart reference gets flagged and stripped before the draft reaches the clinician. That's why red flags #2, #3, and #7 show up far less often in Oli-generated charts than in ungrounded pipelines.
Phrase-loop patterns are detected at the transcription layer, not the note layer. When the speech-to-text component starts emitting repetitive low-signal segments, we flag those segments as untrustworthy before they ever reach the summarizer. The summarizer then treats them as missing audio rather than as real content to paraphrase.
Every AI note stays in draft status until a clinician signs it. Until that moment it isn't visible in the AI patient overview, isn't available for billing, and isn't surfaced to the rest of the care team. The clinician remains the final failsafe — grounding makes their job easier, not optional.
None of this makes hallucinations impossible. No pipeline can promise that. What grounding does is narrow the attack surface: most of the failure modes in the seven red flags above are structurally prevented, and the rest get flagged for review instead of sliding silently into a signed chart. We've written more about why this approach requires an AI-first EHR designed end-to-end rather than bolted onto a legacy product.
A 90-second audit for every chart before you sign
Regardless of which scribe you're using — including Oli — this is the protocol worth running on any AI-generated note before it leaves the draft stage. It's a hallucination check, not a deep clinical review. Five steps, in order.
The 90-second AI chart audit
This isn't a full clinical review. It's the floor — the bare minimum to catch what AI scribes get wrong most often. With a grounded pipeline like Oli's, most of these steps resolve in seconds because the anchors are one click away. With an ungrounded scribe, the audit is doing heavier work, and it's worth doing anyway.
How do I know if my AI scribe is hallucinating?
The strongest signals are notes that read smoother than the conversation actually felt, structured fields populated with specifics you don't remember discussing, and "history of" lines you can't trace back to the audio or to the patient's intake. If your scribe supports transcript grounding (as Oli does), you can click any line and see the verbatim passage that supports it. If it doesn't, the 90-second audit is your substitute.
What to ask your current vendor
If you're not using Oli, these are the questions worth putting to whichever scribe you are using:
- Does every generated sentence carry a pointer back to a specific transcript span, or is the note produced by the model in a single unconstrained pass?
- Is the verbatim transcript preserved after the structured note is generated, or is it discarded?
- Which of the seven failure modes above does your pipeline explicitly detect, and where in the pipeline does detection happen?
- Can the clinician see the transcript passage behind any claim in the draft chart?
If the answers are vague, that's the conversation worth escalating. "We use a state-of-the-art model" is not an answer to any of those questions.
FAQ
How often do AI scribes hallucinate?
It depends on audio quality, pipeline design, and whether the scribe grounds its output against the verbatim transcript. The Koenecke et al. study of Whisper, a widely-used open-source speech-to-text model, found roughly 1% of audio segments contained fabricated phrases, with about 38% of those falling into explicit-harm categories. The paper's harms are not clinical, but the fabrication mechanism is the same one that shows up in AI-generated charts. A 30-minute appointment contains hundreds of segments. Grounded pipelines (like Oli's) catch and drop fabrications before the note finalizes; ungrounded ones rely on the clinician to notice.
What is transcript grounding?
Transcript grounding is the discipline of requiring every sentence in an AI-generated note to be justified by a specific span of the verbatim audio transcript. If the summarization model produces a claim that has no supporting transcript passage, the claim is rejected before it reaches the draft chart. Oli Health's proprietary grounding pipeline is built around this constraint, which is why the fabricated-diagnosis and invented-medication failure modes are substantially rarer in Oli-generated charts.
Is it safe to sign an AI-generated note without reviewing it?
No — even with a grounded pipeline. When you sign a chart you remain professionally and legally responsible for the accuracy of what it says, regardless of which tool produced the draft. The AI vendor is not a co-signer. Grounding narrows the set of errors you need to catch; it doesn't eliminate your role as the last line of defence.
Do expensive AI scribes hallucinate less than free ones?
Not reliably. Hallucination is a function of audio quality, model architecture, and pipeline guardrails — not list price. What matters is whether the scribe grounds its output against the transcript, detects phrase-loops at the transcription layer, and cross-checks drug names and diagnoses against the patient's chart. Ask any vendor, paid or free, which of those guardrails they actually implement.
Can I get in trouble with my regulator for an AI hallucination in a chart?
Potentially, yes. Provincial colleges and US state boards generally treat AI-generated content the same as any other content you sign. Your name on the chart means you attested to its accuracy, and "the AI did it" is not a defense recognized by professional standards I've seen. Transcript grounding plus the 90-second audit is the cheapest possible insurance.
The seven red flags above aren't a reason to abandon AI scribes. They're a reason to demand that the scribe underneath your signature be grounded in what actually happened in the room. Oli Health's grounding pipeline was designed around that demand. Whatever scribe you use, the audit protocol is the floor — and the questions to your vendor are the ceiling.
If you're evaluating AI scribes, ask about transcript grounding specifically. We've documented how Oli's grounding pipeline handles draft enforcement, phrase-loop detection, and per-sentence transcript anchors — the engineering decisions are worth understanding even if you're comparing us against alternatives.

