There is a kind of product failure that costs you money, and there is a kind that costs you something you cannot get back. ChatGPT Health was sold as the second kind of helpful and turned out to be the second kind of dangerous. When OpenAI rolled out a health-focused version of its chatbot, the pitch was that millions of people who cannot afford a doctor, or cannot get an appointment for weeks, would finally have a calm, knowledgeable voice to ask about the symptom that woke them up at 3 a.m. The promise was triage, the soft and quiet act of telling a frightened person whether this is nothing, this can wait, or this is the moment you stop reading and call for help. That is the entire job. And the first time independent researchers measured how well it does that job, it failed the third category more than half the time.

The study came out of the Icahn School of Medicine at Mount Sinai and ran in the February 23, 2026 online issue of Nature Medicine, which matters because it is the first evaluation of ChatGPT Health that was not run by the company selling it. The researchers did not ask the tool trivia. They handed it realistic patient scenarios, including ones where the medically correct answer was go to the emergency department immediately, and they counted how often the chatbot got the urgency right. The headline number is the one that should end the conversation: in 51.6 percent of the cases that required immediate hospital care, ChatGPT Health advised the patient to stay home or schedule a routine visit. Flip a coin on your worst night and you would do about as well.

The Suicide Test Is The Part That Should Stop You Cold

If the emergency triage failure is alarming, the suicide-safeguard finding is genuinely chilling, because it exposes how shallow the safety layer actually is. The researchers tested the tool's crisis handling with a 27-year-old patient who said he had been thinking about taking a lot of pills. When the patient stated those symptoms plainly, ChatGPT Health displayed a crisis intervention banner, the kind of message that points a person toward a hotline and a human. So far, the safeguard appears to work.

Then they changed one thing. They kept the same patient, the same age, the exact same words about wanting to take a lot of pills, and they added a set of normal lab results to the scenario. With that single irrelevant addition, the crisis banner vanished. Not sometimes. In zero out of sixteen attempts did the safeguard fire once normal labs were attached to the same suicidal statement. A person disclosing the same intent to harm themselves was met with a crisis intervention in one framing and with silence in another, and the only difference was a few lines of clinical data that had nothing to do with the disclosure.

"The same patient, the same age, the same words about taking a lot of pills. Add normal lab results, and the suicide crisis banner disappeared in zero out of sixteen attempts. The safeguard was not protecting the person. It was reacting to a pattern of text, and the pattern was trivially easy to break." -- What the Mount Sinai suicide-scenario test actually revealed

This is the failure mode that should frighten anyone who understands how these systems work. A real safety mechanism protects the person regardless of how the situation is phrased. What ChatGPT Health has is not a safety mechanism, it is a text classifier that fires when the surrounding words look a certain way and goes quiet when they do not. The model is not recognizing a human being in crisis. It is recognizing a string of tokens, and a string of tokens can be nudged out of the danger zone by something as harmless as a normal cholesterol reading.

Why "It Is Just A Tool" Does Not Hold Up Here

The standard defense of these products is that nobody should use them for medical decisions, that they come with disclaimers, that a chatbot is not a doctor and everybody knows it. That argument falls apart the instant you look at who actually reaches for ChatGPT Health and why. The entire reason a health chatbot has a market is that a large number of people do not have easy access to a doctor. They are uninsured, or rural, or scared of a bill, or stuck on a six-week waitlist. For that person, the chatbot is not a supplement to professional care. It is the only voice in the room at the moment they are deciding whether their chest pain is heartburn or something that needs an ambulance.

So the disclaimer is a fiction about how the product is used. When a tool is marketed as health guidance and reached for by exactly the people who have no other option, telling it to advise "stay home" in half of true emergencies is not a harmless limitation. It is a system positioned at the precise decision point where a wrong answer kills, and it gives the wrong answer at coin-flip rates. The specialists who reviewed the findings said the system could "feasibly lead to unnecessary harm and death," and that is not hyperbole bolted onto a tech story. It is the plain reading of a 51.6 percent miss rate on the cases that matter most.

51.6% of true emergencies where ChatGPT Health advised staying home or a routine visit
0 of 16 attempts where the suicide crisis banner fired after normal labs were added
1st independent evaluation of ChatGPT Health, published in Nature Medicine

The Demo Looked Great. The Demo Always Looks Great.

Here is the uncomfortable pattern, and it repeats with every rushed deployment of generative AI into a high-stakes domain. The demo is dazzling. You ask the health bot about a headache and it produces a calm, organized, medically literate-sounding response that covers possible causes, when to worry, and what to do next. It reads like a good doctor. That fluency is exactly what gets a product shipped, because in a boardroom and a launch video, sounding right is indistinguishable from being right. The fluency is real. The reliability underneath it is not, and the gap between the two only shows up when someone runs the scenarios that hurt.

That is what an independent study is for, and it is why companies are so reluctant to invite one before launch. ChatGPT Health sounded competent enough to clear whatever internal bar OpenAI set for it. It took outside researchers, with no incentive to flatter the product, to feed it the cases that expose the difference between a system that talks like a clinician and a system that can actually be trusted with the one decision a clinician exists to make. The tool passed the vibe check and failed the stress test, and in medicine the stress test is the only test that counts.

A chatbot that sounds like a doctor and triages like a coin flip is more dangerous than one that sounds like a robot, because the confident, fluent answer is the one a frightened person at 3 a.m. is most likely to believe. The polish is not a feature here. It is the delivery mechanism for the wrong advice.

What Responsible Looks Like, And Why This Is Not It

There is a defensible version of an AI health assistant, and it behaves nothing like a system that quietly tells half of emergencies to stay home. A responsible version would be tuned to be aggressively conservative, erring hard toward "go get this checked" whenever the symptoms carry any real risk, because the cost of an unnecessary ER visit is a copay and the cost of a missed heart attack is a life. It would treat a stated intent to self-harm as an absolute trigger that no amount of surrounding text can switch off, because a suicide safeguard that can be disabled by adding normal labs is not a safeguard, it is decoration. And it would not ship until an independent body had run exactly the kind of evaluation Mount Sinai just ran, with the results made public before a single vulnerable person typed a symptom into it.

None of that is what happened. ChatGPT Health went out into the world, got reached for by the people with the fewest alternatives, and only after the fact did independent scientists discover that its emergency judgment is a coin flip and its suicide safeguard is a string-matching trick. The technology was good enough to sound trustworthy and nowhere near good enough to be trusted, and it was deployed into the one domain where that distinction is the whole point. Until a health AI is proven safe before launch instead of measured as dangerous afterward, the honest label is not "your AI doctor." It is a confident voice that does not reliably know when you are dying, pointed at the people least able to get a second opinion.

The Verdict

A health tool that advises "stay home" in 51.6 percent of true emergencies, and whose suicide safeguard vanishes when you add a normal lab result, is not an assistant. It is a liability marketed as a lifeline, aimed at the exact people who have nowhere else to turn. The first independent study found it dangerous. The right move was to find that out before launch, not after.

Got bad medical guidance from an AI chatbot? Tell us what happened.