Strategy Guide

Generative AI in healthcare: safe patterns for patient communication

Generative AI hallucinates, and in healthcare a confident wrong answer can harm a patient. The patterns that make patient communication safe, from the research.

Author
By the Open Team
|Updated June 18, 2026|8 min read

Generative AI makes things up. This is a property of how these models work, something no upcoming release patches out, and in most settings it is tolerable. A chatbot that invents a slightly wrong fact about a software feature wastes a few minutes. The same model, talking to a patient about a medication, can do real harm. The open question for healthcare is how to use generative AI for patient communication so its known failure mode cannot reach a clinical decision.

We think the safe path is narrower and more boring than the demos suggest, and that the teams who treat it that way get more out of the technology over time. Here is the case for designing around the failure rather than hoping it stays rare.

The failure is measured, and it is common

It is tempting to treat AI hallucination as an edge case that good prompting solves. The research does not support that comfort. A 2025 study of medical hallucinations in foundation models, with authors across several research institutions, surveyed clinicians and found that 91.8% had encountered a medical hallucination in practice and 84.7% believed those hallucinations were capable of causing patient harm. These are not lab curiosities. Working clinicians are seeing fabricated medical content from these systems and judging it dangerous.

Other work has found similar patterns. Researchers at the University of Massachusetts Amherst, working with the healthcare AI firm Mendel, reported hallucinations in almost all of the medical summaries generated by leading models like GPT-4o and Llama-3, concentrated in exactly the high-stakes places: symptoms, diagnoses, and medication instructions. The pattern is consistent. The more clinical the content, the more the cost of a confident error climbs.

So the design problem is narrower than "reduce hallucinations to zero," which no one can promise. The job is to make sure that when the model is wrong, the wrong answer never lands on a patient as medical guidance.

Clinicians are already seeing medical hallucinations

Global clinician survey (n=70), Kim et al. 2025, “Medical Hallucinations in Foundation Models.”

91.8%

surveyed clinicians who had encountered a medical hallucination in practice

84.7%

surveyed clinicians who believed those hallucinations could cause patient harm

Based on a self-reported survey of 70 clinicians.

Where the errors concentrate

Medical-event inconsistencies across 50 summaries per model: UMass Amherst / Mendel, 2024 (GPT-4o, Llama-3).

327

medical-event inconsistencies in GPT-4o summaries

271

medical-event inconsistencies in Llama-3 summaries

Hallucinations clustered in symptoms, diagnoses, and medication instructions.

Safe pattern one: keep the AI in the operational lane

The single most protective decision is scope. Generative AI is genuinely good at the operational layer of patient communication: scheduling, directions, prep instructions, billing process, refill intake, "are my results ready." None of that requires clinical judgment, and a wrong answer there is recoverable.

The clinical layer, interpreting a symptom, explaining what a result means, advising on a dose, is where the failure is catastrophic and where the AI should not be operating autonomously. Drawing that line explicitly, in the system design rather than in a disclaimer, is what keeps generative AI useful and safe at the same time. A patient-facing agent that resolves logistics and refuses diagnosis is doing the thing it is good at and staying out of the thing it is dangerous at. That operational lane is exactly where conversational AI in healthcare earns its place, carrying the safe work to resolution and handing off the rest.

Safe pattern two: hand off when unsure, do not improvise

The second pattern is about behavior at the boundary. Models are happy to produce a fluent answer to almost anything, which means the dangerous moments are the ones where the model is confidently wrong about something it should have declined. The safe behavior is for the AI to recognize low confidence or out-of-scope questions and hand off to a human rather than generate a guess.

This is a deliberate design choice with a real tradeoff. An AI tuned to escalate when unsure will resolve fewer conversations end to end. That is the point. Open.cx built its Agent 5 model to be conservative this way, handing off to a person when confidence is low rather than reaching for a plausible answer, because in a patient context a clean escalation is a feature and a smooth guess is a liability. The teams that demand a high autonomous-resolution rate on clinical-adjacent traffic are optimizing for the wrong number.

Safe pattern three: ground answers in your own knowledge

A generative model left to answer from its training data is answering from the open internet, averaged. A model grounded in your specific policies, your clinic's instructions, your formulary, is answering from a controlled source. Retrieval-grounded generation, where the model draws from your verified knowledge rather than its parametric memory, narrows the space in which it can invent.

Open.cx's Agent 5 ingests raw knowledge directly rather than relying on scripted question-and-answer pairs, which means the agent can be pointed at the provider's actual documentation and kept there. Grounding is not a cure for hallucination, the studies are clear that no method eliminates it, and it meaningfully reduces the surface area, because the model has the right answer in front of it instead of reconstructing one.

Safe pattern four: minimize the PHI in the loop

Safety covers exposure as well as accuracy. Every piece of protected health information that flows through a generative system is a piece that can leak, persist where it should not, or end up in a training set. The HIPAA minimum necessary standard is a safety pattern as much as a compliance one: the less PHI the model handles, the smaller the blast radius of any failure.

Concretely, that means redacting identifiers before they reach the model or the logs, and confirming through a business associate agreement that patient data is never used to train shared models, the same controls that define a HIPAA-compliant AI chatbot for patient support. Open.cx redacts sensitive data before it reaches the model or the logs. A generative system that sees only what it needs is safer in every dimension, including the ones that have nothing to do with hallucination.

What this adds up to

These patterns share a philosophy: assume the model will fail, and build so the failure is contained. Scope it to safe work. Make it escalate when unsure. Ground it in verified knowledge. Keep PHI out of the loop. None of these is a clever trick. They are the unglamorous discipline of deploying a probabilistic system in a setting where some outputs can hurt people.

The alternative philosophy, deploy broadly and trust that hallucinations are rare enough, is the one that produces the headlines health systems do not want. The research says hallucinations are not rare. The safe move is to design as if every answer could be the wrong one, and to make sure the wrong one cannot reach a patient as advice.

The honest version of the pitch

Vendors will tell you their model does not hallucinate, or barely does. Treat that as a tell. The credible position is that hallucination is inherent, that the published research shows it is common in medical contexts, and that the engineering job is containment. A vendor who says that out loud is more trustworthy than one who promises perfection. Generative AI in healthcare is worth deploying. It is worth deploying carefully, in the narrow lane where its strengths are real and its failures are caught before they reach the patient.

Frequently Asked Questions