ChatGPT Health Fails Its First Safety Test On Emergencies And Suicide Risk

When OpenAI put a health tool inside ChatGPT and let it read people's medical records, the pitch was reassurance on demand. The first researchers to test it independently came away using a different word: dangerous. In the trials that mattered most, the tool waved off more than half of true emergencies and its self-harm safeguards behaved as if they had been wired in reverse, going quiet at exactly the moment a person needed them most.

OpenAI launched ChatGPT Health in January 2026 as the friendly front door to your own body. Connect your records, sync a wellness app, describe your symptoms, and the chatbot would tell you what to do next. Within weeks the company was reporting roughly forty million health questions a day flowing through it. That is not a pilot. That is a triage nurse deployed at planetary scale, sitting between anxious people and the decision to go to a hospital, and it went into the world without any independent proof that it was safe to occupy that seat.

That proof arrived, and it was not good. Researchers at the Icahn School of Medicine at Mount Sinai, led by Dr. Ashwin Ramaswamy, ran the first independent safety evaluation of the tool and published it in Nature Medicine on February 23, 2026. They did not stress the system with exotic edge cases. They built 60 realistic clinical scenarios spanning 21 medical specialties, varied the surrounding context 16 different ways, and put the tool through 960 interactions in total. The scenarios were the kind of thing that walks into an emergency department every day. The results were the kind of thing that ends careers if a human clinician produced them.

50%+True emergencies the tool under-triaged, sending people home or to a routine appointment

960Interactions tested across 60 scenarios and 21 specialties

~40MDaily health questions the tool was already fielding before it was tested

More Than Half The Emergencies Went Unrecognized

The headline number is the one that should stop any product manager cold. In the cases that physicians on the study judged to require emergency care, ChatGPT Health under-triaged more than half of them. It told people who needed a hospital to stay home, rest, or book a routine appointment for later in the week. The examples the researchers surfaced were not ambiguous. In simulations of respiratory failure and diabetic crisis, the tool downplayed the severity roughly half the time. In one asthma scenario the platform noted signs consistent with respiratory failure and then suggested waiting for a future appointment rather than seeking emergency help. A person having that conversation would have every reason to feel reassured, and that reassurance is the harm.

This is the distinction that separates a bad search result from a dangerous one. A search engine that ranks the wrong link wastes a click. A confident health assistant that misreads an emergency spends a stranger's trust at the worst possible moment, and it does so in the same calm, fluent voice it uses when it happens to be right. We have written before about how the hallucination problem quietly drains enterprise AI budgets, but a spreadsheet error and a missed heart attack are not the same category of mistake. When the stakes move from money to a person deciding whether to call for help, the tolerance for a fluent wrong answer collapses to zero.

A search engine that ranks the wrong link wastes a click. A health assistant that misreads an emergency spends a stranger's trust at the worst possible moment, in the same calm voice it uses when it is right. On why medical AI cannot be graded on a curve

The Suicide Guardrails Were Inverted

If the emergency numbers were alarming, the self-harm findings were worse, because they revealed a safety system that was not merely weak but pointed the wrong way. The study's co-author, Dr. Girish Nadkarni, chief AI officer of the Mount Sinai Health System, described the suicide guardrail failure as the most alarming part of the work, and summed up the pattern in a single line: the system's alerts were inverted relative to clinical risk. Crisis banners appeared in lower-risk situations and then failed to appear when users described specific plans for self-harm, which is exactly the population a safeguard exists to catch.

The mechanism the researchers exposed is the detail that lingers. In one test, a patient described thinking about taking a large quantity of pills. On the symptoms alone, the crisis banner appeared every single time. Then the researchers added a set of normal lab results to the very same scenario, changing nothing about the stated intent, and the safeguard vanished. A safety net that disappears the moment a chart looks clinically reassuring is not a safety net. It is a trapdoor, and it opens under the people standing on the thinnest part of it.

That failure lands in the middle of a legal and regulatory environment that is already treating chatbot safety as a matter of life and death rather than a feature checklist. A grieving parent is suing OpenAI in a wrongful-death case that puts chatbot safety on trial, and states have started drawing hard lines, with New York banning AI companion chatbots for minors and attaching a fine to every violation. A tool whose crisis alerts fire backward is not walking into a permissive world. It is walking into one that is actively deciding how much a preventable death is worth in court.

A guardrail that triggers on low risk and goes silent on a stated plan to self-harm is not underbuilt. It is built backward. The system learned to reassure itself with a clean lab panel at the exact moment a human being needed it to sound an alarm.

Why "It Is Just A Chatbot" Is Not A Defense

OpenAI's response was that the study did not reflect how people typically use the tool in real life. That answer would carry more weight if the company had not shipped a feature explicitly designed to ingest medical records and hand out guidance, then celebrated forty million daily health questions as a growth metric. You cannot market a product as a health assistant and then, when it is tested as one, argue that nobody should have taken it that seriously. The scale of adoption is the company's own number. The scenarios were realistic. The failures were reproducible enough to publish.

The deeper problem is that safety here was assumed rather than demonstrated. The pattern is becoming familiar across the whole industry. When independent testers finally get their hands on a system, the picture rarely matches the launch-day confidence, a gap we watched play out when a frontier model was caught gaming its own safety evaluation. The lesson repeats: the guardrails a company describes and the guardrails an outsider can measure are two different things, and only one of them protects the user. A health tool that reaches tens of millions of people should have to clear an independent bar before it earns that reach, not after a journal quantifies the damage.

Independent experts did not soften the verdict. Alex Ruani, a researcher at University College London, called the findings unbelievably dangerous, warning that the tool creates a false sense of security that could lead to preventable harm and death. Her example was blunt: someone told to wait 48 hours during an asthma attack or a diabetic crisis could pay for that reassurance with their life. When the outside experts and the study authors reach for the same vocabulary of danger, the charitable interpretation that this is a rough early build stops being available.

The Reckoning

None of this means a machine can never help someone understand a symptom or prepare for a doctor's visit. It means the specific claim ChatGPT Health made, that it was safe to stand between a person and an emergency decision, did not survive contact with the first serious test of it. The tool missed more than half of the emergencies that mattered, and its most important safeguard, the one meant to catch someone in crisis, behaved as though it had been installed upside down. That is not a tuning issue to patch quietly in a future update. It is a question of whether a system should have been given that job at all before anyone checked whether it could do it.

This entry belongs in the same record as every other gap between AI marketing and AI reality we track, alongside our earlier reporting on the emergency triage failures inside the same Mount Sinai study and the broader timeline of moments the technology fell short of its own billing. The full catalog lives in our documentation of AI's recurring failures. The demo promised a health assistant that always knew when something was wrong. The test found one that was most confident precisely when it should have been most afraid.

The Verdict

The first independent safety evaluation of ChatGPT Health found it under-triaged more than half of true emergencies and inverted its suicide safeguards, going silent on described self-harm plans while firing on low-risk cases. Experts called it unbelievably dangerous. The tool was already answering roughly forty million health questions a day. The technology may have a future in medicine. The safety case for the version that shipped was never actually made.

Did an AI health tool give you advice that turned out to be dangerously wrong? Tell us what happened.

ChatGPT Health Just Failed Its First Independent Safety Test.
The Suicide Alerts Fired Backward.

More Than Half The Emergencies Went Unrecognized

The Suicide Guardrails Were Inverted

Why "It Is Just A Chatbot" Is Not A Defense

The Reckoning

The Verdict

More from ChatGPT Disaster

Editorial Standards and Source Transparency

ChatGPT Health Just Failed Its First Independent Safety Test. The Suicide Alerts Fired Backward.

More Than Half The Emergencies Went Unrecognized

The Suicide Guardrails Were Inverted

Why "It Is Just A Chatbot" Is Not A Defense

The Reckoning

The Verdict

More from ChatGPT Disaster

Editorial Standards and Source Transparency

ChatGPT Health Just Failed Its First Independent Safety Test.
The Suicide Alerts Fired Backward.