The most dangerous failures are the ones that look like success most of the time. ChatGPT Health, the dedicated medical assistant that OpenAI launched in January 2026, handles a textbook stroke or a severe allergic reaction the way you would hope: it recognizes the danger and tells the user to seek emergency care. That competence is exactly what makes the rest of the findings so unsettling. When researchers at the Icahn School of Medicine at Mount Sinai pushed past the obvious cases and tested the situations where danger hides, the tool that seemed so reliable started missing emergencies more often than it caught them. The study, led by Dr. Ashwin Ramaswamy and published February 23, 2026 in Nature Medicine, is one of the most rigorous independent audits a consumer medical chatbot has faced, and the picture it paints is of a system that is confident, fluent, and wrong in precisely the moments that matter most.

This is not a story about a chatbot being a little imperfect. It is a story about a tool that 40 million people were leaning on every single day, within weeks of launch, to help them decide whether a symptom was nothing or an emergency. When a system at that scale gets the triage call wrong on more than half of the cases that need a hospital, the failure is not an edge case. It is the core function breaking under the exact pressure it was built to handle.

How The Study Was Built To Be Hard To Dismiss

The reason this audit lands harder than the usual screenshot-and-outrage cycle is the methodology. The Mount Sinai team did not throw a handful of trick questions at the model and call it a day. They built 60 structured clinical scenarios spanning 21 medical specialties, then ran each one under 16 different contextual conditions, varying details like race, gender, social dynamics, and barriers to care. That design produced 960 separate interactions with ChatGPT Health, and the correct urgency level for each scenario was not decided by the researchers' gut. Three independent physicians set the right answer using guidelines drawn from 56 medical societies. In other words, the benchmark the chatbot was graded against is the same standard a real clinician is held to.

That structure matters because it forecloses the easy rebuttals. You cannot wave this away as cherry-picked prompts when the scenarios cover 21 specialties. You cannot call the grading subjective when three physicians and 56 medical societies set the key. And you cannot argue the model just needs more context when the failures got worse precisely as the scenarios moved away from textbook presentations and toward the nuanced, real-world situations where the danger is not stamped on the surface. The harder the case looked like an actual patient, the worse ChatGPT Health did.

960Clinical interactions tested
21Medical specialties covered
40MDaily users at the time

Under-Triaging More Than Half Of The Real Emergencies

The headline number is the one that should stop you cold: in cases that genuinely required emergency care, ChatGPT Health under-triaged more than half of them. Under-triage is the specific, deadly direction of error here. It does not mean the tool was overly cautious and sent worried-but-fine users to the ER. It means the opposite. It means that for the majority of the truly urgent scenarios, the system told the user the situation was less serious than it actually was. A patient relying on that read would be reassured at exactly the moment they needed to be alarmed, sitting at home with a condition that the model had quietly downgraded.

The model did well on the emergencies that announce themselves, the strokes and the anaphylaxis, the cases where the textbook description and the danger are the same thing. It fell apart on the nuanced situations where the threat is not immediately obvious, which is, of course, where human clinical judgment earns its keep and where a frightened person at home is least equipped to overrule a confident machine. A triage tool that only works when the emergency is already obvious is not a triage tool. It is a mirror that reflects the danger back to you only once you can already see it yourself.

The Suicide Alerts Were Inverted

If the triage finding is alarming, the crisis-alert finding is worse, because of the specific way it failed. The study found that ChatGPT Health's suicide crisis alerts appeared inconsistently, and not randomly inconsistently. They were inverted relative to risk. The alerts triggered inappropriately in lower-risk scenarios, and then failed to appear when users described specific self-harm plans. The single most dangerous input a mental-health-adjacent system can receive, a person stating concretely how they intend to hurt themselves, is the input most likely to slip past the safeguard.

"The system's alerts were inverted relative to clinical risk, appearing more reliably for lower-risk scenarios than for cases when someone shared how they intended to hurt themselves." Dr. Girish N. Nadkarni, on the Mount Sinai ChatGPT Health findings

Sit with what that inversion means in practice. A user expressing vague, lower-acuity distress gets a crisis banner and a hotline number, which is appropriate enough on its own. But a user who crosses the most serious line, who states a plan, gets fluent, calm, helpful-sounding text with no alert at all. The safeguard is most likely to be absent exactly when its absence is most catastrophic. This is not a guardrail with a few gaps. It is a guardrail wired backward, and it was live in front of 40 million daily users.

Why This Is The Pattern, Not The Exception

Regular readers of this site will recognize the shape of this failure, because it is the same shape that surfaces everywhere AI gets deployed faster than its reliability can be verified. The pattern is fluency mistaken for competence. A model that produces confident, well-formatted, authoritative-sounding output gets trusted in proportion to how convincing it sounds, not in proportion to how often it is right, and the gap between those two things is where people get hurt. It is the same gap that lets unverified AI output quietly kill enterprise projects in budget review, except here the budget line is a human being deciding whether to go to the hospital.

And the consequences are not theoretical. This site has already documented a wrongful-death lawsuit brought by a mother who says an OpenAI chatbot played a role in her daughter's suicide. The Mount Sinai study is the clinical, peer-reviewed, methodologically armored version of the same warning: when a system that fluently discusses health and crisis is wired so that its protective alerts fire backward, the harm is not a hypothetical that lawyers will argue over years from now. It is a measurable property of the tool, sitting in a Nature Medicine paper, describing software that tens of millions of people opened today.

The defense will be that ChatGPT Health is not a doctor and never claimed to be, that the terms of service say to seek professional care, that the disclaimers were all in place. All of that is true and none of it matters at the scale of 40 million daily users, because a disclaimer does not change behavior the way a confident, specific, reassuring answer does. People do not read a chatbot's legal footer. They read its answer, and when that answer under-triages a real emergency or stays silent on a stated suicide plan, the disclaimer is a paper shield against a documented, repeatable harm.

The Verdict

ChatGPT Health did not fail a pop quiz. It failed a 960-interaction audit built by physicians against the standards of 56 medical societies, published in Nature Medicine, under-triaging more than half of true emergencies and inverting the suicide crisis alerts that are supposed to be its last line of defense. A tool that confidently downgrades real emergencies and goes quiet on stated self-harm plans, in front of 40 million daily users, is not a medical assistant. It is the fluency-over-competence failure at its most dangerous scale.

Did an AI health tool give you or someone you know dangerously wrong medical guidance? Tell us what happened.