All clinical examples in this article are fabricated composites. Nothing here describes a real patient.

Invented medications. Repeated phrases. A family history attributed to the wrong person. A "history of hypertension" the patient never mentioned. Time references that drift by months. A "pain 6/10" the patient never said.

A clinic owner emailed me 14 of her practice's AI-generated notes last week and asked if I'd take a look. She was nervous about something but couldn't put a finger on what. I found six different categories of hallucination across nine of those fourteen notes. She had already signed all nine.

This is not a hit piece on any specific vendor. It's the same story across products. Jane's Charting AI Scribe, SimplePractice's Note Taker, Heidi, Nuance DAX, our own AI Notes at Oli — every ambient scribe on the market makes things up under the right conditions. What separates a safe practice from an unsafe one isn't the scribe. It's whether the clinician on the receiving end has a habit for catching it.

This is that habit.


Why an AI scribe hallucinates in the first place

The short version. An AI scribe is two models stitched together. A speech-to-text model transcribes the audio. A language model rewrites the transcript into a structured note. Both can fail. The language model fails most loudly.

The transcription layer hallucinates when the audio is bad. Long pauses, background noise, an HVAC unit, a barking dog, a TV in the waiting room. Anything that introduces low-signal segments. A 2024 academic study of Whisper, a widely-used open-source speech-to-text model, found that roughly 1% of audio segments contained entirely fabricated phrases. About 38% of those fabrications fell into what the authors called explicit-harm categories: perpetuating violence, asserting inaccurate associations between people, or implying false authority. None of those categories are clinical, but the underlying failure mode (fluent text generated from low-signal audio) is exactly what shows up in chart hallucinations. One percent sounds small. A 30-minute consult contains hundreds of audio segments.

The language model hallucinates differently. It pattern-matches. If your patient is a 58-year-old presenting with chest discomfort, the model has been trained on enough cardiology notes that it learns to expect "history of hypertension" near that profile. So it tends to write it, even when nobody said it. The model is not trying to deceive you. It's predicting the next plausible word. Sometimes the next plausible word is a lie.

Why does my AI scribe hallucinate?

An AI scribe hallucinates when the underlying speech-to-text model encounters low-signal audio (background noise, long pauses, unclear speech) or when the summarization model pattern-matches your note onto its training data instead of the actual transcript. Both layers can introduce content that was never spoken in the room.


The seven red flags I keep finding

I've ranked these roughly in order of how often I see them, not how dangerous they are. The dangerous ones are at #3 and #6.

1. Phrase-loop repetition

"The patient reports mild discomfort" appears three times in the same note — once in the subjective, once in the assessment, once in the plan. Or "Tolerated treatment well" loops across the bottom of the chart like a watermark.

This is what happens when a low-signal segment of audio sends the speech-to-text model into a repetition trap, and the summarization model dutifully copies the loop into multiple sections. One of the oldest known failure modes in audio ASR. The fix is mechanical: search every finished note for any 4-word sequence that appears more than once. Most notes shouldn't have any. If you find a loop, suspect the whole section it sits in.

2. Invented medications

A drug name shows up in the note that the patient never said. Often it's a near-rhyme. "Metoprolol" for "metformin." "Zoloft" for something the patient was actually trying to remember.

Usually the audio briefly dropped, or a background sound contained a vowel pattern that resembled a drug name. The model carries thousands of drug names in its vocabulary and is statistically eager to surface them. Cross-check every medication in any AI-generated note against the patient's chart and intake form. If it isn't already in the patient's history, it didn't get there from a real conversation in the room.

3. Fabricated past diagnoses

This is the one that scares me most.

The note says "Patient has a history of hypertension." The patient does not have a history of hypertension. The audio never contained the word. The language model is pattern-matching the patient's demographic profile against typical chart patterns and inserting what it thinks belongs there.

The reason this is the worst category is because the lie sounds completely normal. A reviewer skimming for accuracy will read past it. The reader has to actively distrust the history section to catch this one. My rule: never trust a "history of" line in an AI-generated note unless I can point to where the patient said it. Spot-check against intake forms and prior visits. Always.

4. Pronoun drift

A note that opens with "she" switches to "he" two paragraphs in. Or a non-binary patient who uses "they/them" gets defaulted to "he" the entire way through.

The model carries a statistical bias toward "he" for clinical subjects, and when it loses track of which pronoun your patient uses, it drifts to its default. Cmd-F (or Ctrl-F) for "he", "she", and "they" before you sign. It takes ten seconds.

5. Time confusion

The patient said "two months ago." The note says "two weeks ago." "Since the surgery in 2022" becomes "since the surgery in 2002."

Numbers in audio are fragile. They get transcribed poorly under almost any condition, and the downstream model rarely catches the inconsistency. Highlight every time reference on first read of a chart.

6. Family history that ends up attributed to the patient

The patient was describing their mother's stroke at 62. The note says the patient had a stroke at 62.

This one is rare but it's the other category that scares me, alongside #3. The model loses track of who's speaking about whom — pronouns and subjects are genuinely hard to keep straight when family members are being discussed in detail. Read the family history section twice. Slowly. Anything attributed to the patient that the patient didn't claim is the kind of error that ends up in a complaint to your regulator.

7. Fabricated measurements

"Pain 6/10." "BMI 31.2." "Range of motion 45 degrees." Numbers the patient never stated and you never measured.

The model fills in a structured field because it's been trained that clinical notes contain those fields. The number itself is fabricated whole.

I've stopped trusting any number in an AI-generated note unless I can trace it to either a verbal report from the patient or a measurement I took. If neither exists, the number doesn't either.


"I caught my scribe putting 'history of GERD' on three patients in one afternoon. None of them have GERD. That was the day I started reading every chart line by line." — naturopath in Mississauga

A 90-second audit for every chart before you sign

This is the protocol I'd run on any AI-generated note before it leaves the draft stage. It is not a deep clinical review. It's a hallucination check. Five steps, in order.

A clinician's gloved hand marks a suspicious line on a printed clinical chart with a red pen during an audit

The 90-second AI chart audit

Step 1 (15s): Search the note for any 4-word phrase that appears twice. If you find one, suspect the whole section.
Step 2 (15s): Cross-check every medication against the patient's record. If a drug is in the note but not the chart, it should not be in the note.
Step 3 (20s): Read the history section. Anything labeled "history of [condition]" — confirm it appears in the patient's intake or prior visit.
Step 4 (20s): Cmd-F the pronouns. Confirm consistency. Confirm match with the patient.
Step 5 (20s): Highlight every number and time reference. If you didn't measure it or hear the patient say it, delete it.

This isn't a full clinical review. It's the floor — the bare minimum to catch what AI scribes get wrong most often. After a few weeks it becomes muscle memory.

How do I know if my AI scribe is hallucinating?

The strongest signals are notes that read smoother than the conversation actually felt, structured fields populated with specifics you don't remember discussing, and "history of" lines you can't trace back to the audio or to the patient's intake. If you spot any of those, the note has probably hallucinated something. The 90-second audit above is built to surface them quickly.


What a safer pipeline actually looks like

Mentioning Oli here because it would be dishonest not to. The same engineering decisions apply to anyone building a scribe responsibly.

In our pipeline at Oli we detect the phrase-loop pattern in the transcription layer and flag it before it ever reaches the summarization stage. Every medication that surfaces in a generated note gets cross-checked against the patient's chart and intake form before the note finalizes. And every AI note stays in draft status until a clinician signs it. Until that moment it isn't visible in the AI patient overview, isn't available for billing, and isn't surfaced to the rest of the care team.

None of that makes hallucinations impossible. It catches some of the known failure modes early. The clinician signing the chart is still the failsafe. We've talked about this in why an AI-first EHR has to be designed end-to-end instead of bolted onto a legacy product.

If the scribe you're using won't tell you what its known failure modes are, or where in the pipeline detection happens, that's a conversation worth having with your vendor.


FAQ

How often do AI scribes hallucinate?

It depends on audio quality and the specific scribe. The Koenecke et al. study of Whisper, a widely-used open-source speech-to-text model, found roughly 1% of audio segments contained fabricated phrases, with about 38% of those falling into explicit-harm categories. The paper's harms are not clinical, but the fabrication mechanism is the same one that shows up in AI-generated charts. A 30-minute appointment contains hundreds of segments.

Is it safe to sign an AI-generated note without reviewing it?

No. When you sign a chart you remain professionally and legally responsible for the accuracy of what it says, regardless of which tool produced the draft. The AI vendor is not a co-signer. If a hallucinated medication or fabricated history appears in a chart you signed, the regulator and the insurer will look to you, not the software.

Do expensive AI scribes hallucinate less than free ones?

Not reliably. Hallucination is a function of audio quality, model architecture, and pipeline guardrails — not list price. Some free scribes have stronger phrase-loop detection than premium ones. Some premium scribes have weaker family-history attribution. Ask any vendor, paid or free, what specific failure modes they detect for and how.

Can I get in trouble with my regulator for an AI hallucination in a chart?

Potentially, yes. Provincial colleges and US state boards generally treat AI-generated content the same as any other content you sign. Your name on the chart means you attested to its accuracy, and "the AI did it" is not a defense recognized by professional standards I've seen. The 90-second audit is the cheapest possible insurance.


The seven red flags above are not a vendor problem. They're a category problem. The clinicians managing this well aren't using a different scribe. They're running the audit. Every chart, every time. You deserve a scribe that surfaces its own failures instead of hiding them. Until the rest of the industry gets there, the audit is what stands between your patient's chart and a number nobody actually said.


If you're building chart-audit habits with your team, the same protocol applies regardless of which scribe you're running. We've documented how Oli handles draft enforcement and phrase-loop flagging — even if you're not running our product, the engineering decisions are worth borrowing.