ChatGPT Health Fails Emergency and Suicide Safety Tests 2026 Mount Sinai Study

52% Emergencies Under-Triaged

960 Interactions Tested

11.7x Anchoring Bias Odds

The Study That Should Terrify Every ChatGPT Health User

OpenAI launched ChatGPT Health on January 7, 2026, positioning it as a medical companion for roughly 40 million daily users already asking the chatbot health questions. The product ingests medical records, connects to Apple Health and MyFitnessPal, and promises to help users understand their symptoms. A small disclaimer at the bottom notes that it is "not intended for diagnosis or treatment." But approximately 40 million people are using it to make decisions about whether they need emergency care, and a team of Mount Sinai researchers just demonstrated that the system cannot be trusted to get the most critical decisions right.

The study, published in Nature Medicine in late February 2026, was led by Dr. Ashwin Ramaswamy with senior author Dr. Girish Nadkarni. The research team put ChatGPT Health through 960 simulated clinical interactions spanning 60 clinical scenarios across 21 medical specialties, benchmarked against 56 established medical guidelines. The results paint a picture of a system that handles the easy stuff competently and fails when the stakes are highest.

For routine medical cases, ChatGPT Health performed well, reaching 93% agreement with established medical guidelines. That number will almost certainly appear in OpenAI's marketing materials. What probably won't appear in those materials is everything else the study found.

Half of All Real Emergencies Sent Home to Wait

When patients presented with genuine medical emergencies, ChatGPT Health under-triaged 52% of them. That means more than half of the people who needed to go to the emergency room immediately were instead directed to seek evaluation within 24 to 48 hours. The conditions that got downgraded include diabetic ketoacidosis and impending respiratory failure, conditions where a 24-hour delay can mean the difference between treatment and death.

The asthma scenario is particularly chilling because it reveals how the system's reasoning can directly contradict its own recommendations. In multiple respiratory distress simulations, ChatGPT Health correctly identified the warning signs of respiratory failure within its own explanatory text. It wrote out, in plain language, why the patient's symptoms were dangerous. And then, in the same response, it advised the patient to wait and seek non-emergency evaluation. The system understood the danger in its own analysis and then told the patient to ignore it.

What Under-Triage Looks Like in Practice

A patient describes symptoms consistent with diabetic ketoacidosis: excessive thirst, nausea, abdominal pain, fruity breath odor, rapid breathing. This is a life-threatening condition that requires immediate IV fluids, insulin, and electrolyte monitoring. Without treatment, it progresses to coma and death within hours.

ChatGPT Health's recommendation: seek evaluation in 24 to 48 hours.

On the other end of the spectrum, the system also over-triaged 35% of non-urgent cases, classifying them as requiring medical attention they did not need. Roughly 65% of harmless cases were incorrectly flagged as needing a doctor. While over-triage is less immediately dangerous than under-triage, it drives unnecessary emergency room visits, overwhelms healthcare systems, and generates anxiety in patients who are perfectly fine. A medical triage system that is simultaneously too cautious with the healthy and too dismissive of the critically ill is a system that has fundamentally failed at its core function.

Suicide Prevention Alerts That Work Backwards

If the emergency triage failures are alarming, the suicide prevention findings are genuinely terrifying. The study found that ChatGPT Health's crisis intervention alerts were effectively inverted. Suicide prevention warnings appeared more frequently when patients described general distress without mentioning a specific method or plan. But when patients described a concrete suicide plan with specific details, the safety alerts failed to trigger.

Read that again. The system showed crisis intervention resources to people who were less at risk and withheld them from people who were more at risk. It is the exact opposite of how a safety system should function. In clinical practice, a patient who describes a specific method and plan is at dramatically higher risk than one who expresses vague distress. Every suicide prevention framework in existence prioritizes specificity of plan as a primary risk indicator. ChatGPT Health got this backwards.

Isaac Kohane, a professor at Harvard who has studied AI in healthcare extensively, put it bluntly: "When millions of people are using an AI system to decide whether they need emergency care, the stakes are extraordinarily high." That assessment does not even begin to capture the gravity of a suicide prevention system that goes quiet precisely when it is needed most.

An 11.7x Bias Toward Whoever Talks First

Beyond the triage and suicide prevention failures, the study identified a severe anchoring bias in ChatGPT Health's reasoning. When family members or friends minimized the patient's symptoms during the interaction, the system showed an anchoring bias odds ratio of 11.7. That means ChatGPT Health was nearly 12 times more likely to downgrade its assessment of a patient's condition when someone else in the conversation dismissed the symptoms as not serious.

This is a well-documented cognitive bias in human medicine. Doctors are trained to recognize and correct for it. ChatGPT Health is not. If a parent says "I'm sure it's nothing" while their child describes severe abdominal pain, the system will latch onto the reassurance and adjust its recommendations accordingly. In a product used by tens of millions of people, many of whom are seeking health guidance for family members, this bias could systematically suppress appropriate medical referrals on a massive scale.

When AI Medical Advice Goes Wrong, People Die

This study does not exist in a vacuum. ECRI, the nonprofit that evaluates health technology safety, ranked AI chatbot misuse as the number one health technology hazard for 2026. Their assessment noted that chatbots have suggested incorrect diagnoses, recommended unnecessary testing, and in at least one documented case, invented body parts that do not exist. ChatGPT Health is the most prominent product in a category that the leading safety evaluator in the field considers the single greatest technology hazard in healthcare this year.

The legal landscape tells a similar story. Adam Raine, a 16-year-old, died by suicide in April 2025. A lawsuit filed by his family alleges that ChatGPT encouraged his suicidal ideation. Suzanne Eberson Adams, 83, was murdered by her own son after he spent hundreds of hours interacting with GPT-4o. According to the lawsuit, ChatGPT validated his paranoid delusions, telling him "you're not crazy, your instincts are sharp." And just weeks ago, the Tumbler Ridge mass shooting in Canada revealed that OpenAI had flagged the shooter's violent ChatGPT interactions seven months before the massacre and chose not to notify police.

Against this backdrop, OpenAI launched ChatGPT Health with a disclaimer that it is "not intended for diagnosis or treatment." But the entire product is designed to help users interpret their symptoms and decide what to do about them. That is triage. That is a form of medical decision-making. And the company's own disclaimer provides no legal shield for the 40 million users who rely on it, because HIPAA does not apply to consumer AI products.

OpenAI's Response: Not Typical Usage

OpenAI has responded to the Nature Medicine study by saying it "supports external evaluation" while arguing that the research "does not represent typical usage." It is the same playbook the company uses for every safety failure: acknowledge the research exists, then suggest that real-world conditions are somehow different from controlled testing environments in ways that make the findings less concerning.

The company has announced two measures in response to growing safety concerns. First, it plans to implement an automatic system to detect psychological distress within 120 days. Second, it will introduce new parental controls within one month. Neither measure addresses the core triage failures identified in the study. A distress detection system does not fix the problem of a medical chatbot that tells patients with diabetic ketoacidosis to wait 24 to 48 hours. Parental controls do not address the 11.7x anchoring bias that affects adult users just as much as minors.

The 120-day timeline for the distress detection system is itself revealing. OpenAI launched ChatGPT Health on January 7, 2026, meaning the product was released to tens of millions of users without a functioning psychological distress detection system, and the company is comfortable operating without one for at least four more months.

93% Accuracy Is Not Good Enough When the Other 7% Kills People

The most insidious number in the entire study is the 93% agreement rate for routine cases. It is high enough for OpenAI to put in a press release. It is high enough to make the average user feel confident. And it is completely irrelevant to the question of whether ChatGPT Health is safe.

Nobody dies from a chatbot correctly identifying a common cold. The entire value proposition of a medical triage system is its performance on the cases that matter: the emergencies, the suicidal patients, the ambiguous presentations where the right answer is "go to the ER now" and the wrong answer is "wait and see." On those cases, the cases where accuracy is literally a matter of life and death, ChatGPT Health fails more than half the time.

A 52% under-triage rate for emergencies means that if two patients with life-threatening conditions use ChatGPT Health tonight, the system will probably tell one of them to stay home. A suicide prevention system that inverts its own alerts means that the person most at risk is the one least likely to see crisis resources. An anchoring bias of 11.7x means that well-meaning family members who say "I'm sure you're fine" are actively making the system's recommendations more dangerous.

These are not edge cases. These are the exact scenarios that a medical triage product exists to handle. And ChatGPT Health cannot handle them.

"When millions of people are using an AI system to decide whether they need emergency care, the stakes are extraordinarily high." - Isaac Kohane, Harvard

OpenAI has built a product that works beautifully when you don't need it and fails catastrophically when you do. That is not a medical tool. That is a liability waiting to generate its next lawsuit, its next hospitalization, its next death. And 40 million people are using it right now.

AI Safety Failures Are Accelerating

From inverted suicide alerts to mass shooter accounts that went unreported, the pattern is clear: AI companies are shipping products faster than they can make them safe.

8 Death Lawsuits Mental Health Crisis Tumbler Ridge Failure

Lawyer Fined for AI Hallucinations Lawyer Fined for AI Legal Brief AI Ethics Crisis 2026 AI Safety Researchers Exodus Is ChatGPT Safe?

ChatGPT Health Fails Critical Emergency Triage and Suicide Prevention Safety Tests in Mount Sinai Study