PUBLIC HEALTH WARNING

Experts Call ChatGPT Health "Unbelievably Dangerous" After AI Fails to Recognize Over Half of Medical Emergencies

40 million Americans use ChatGPT for health questions every day. A landmark study found it told patients experiencing respiratory failure and diabetic crises to wait 48 hours instead of going to the ER.

March 26, 2026

52% of emergencies under-triaged
40M daily health queries
960 test interactions analyzed

A Tool Used by Millions, Failing When It Matters Most

When OpenAI launched ChatGPT Health on January 7, 2026, the company framed it as a breakthrough in personal healthcare. Users could connect their patient portals, Apple Health data, and wellness apps to ChatGPT, then ask questions grounded in their own medical records. The pitch was compelling: personalized health guidance, available 24/7, with "purpose-built encryption" and advanced AI reasoning behind it. Within weeks, roughly 40 million Americans were turning to the tool for health-related queries every single day.

But the first independent safety evaluation of ChatGPT Health, published in the journal Nature Medicine on February 23, 2026, has revealed a picture so alarming that medical professionals and AI safety researchers are now publicly warning that the tool could get people killed.

The study, led by Dr. Ashwin Ramaswamy and a team of researchers at the Icahn School of Medicine at Mount Sinai, tested ChatGPT Health with 60 clinician-authored patient scenarios spanning 21 clinical domains. They ran nearly 1,000 queries through the system under 16 different factorial conditions, varying patient gender, adding test results, and including family member comments. Three independent physicians reviewed each scenario and agreed on the appropriate care level before the AI was tested.

The results were devastating. ChatGPT Health under-triaged 52% of gold-standard emergencies, meaning it told patients who needed immediate emergency care that they could wait 24 to 48 hours instead. In a clinical setting, that kind of delay can mean the difference between life and death.

Real Scenario: Respiratory Failure Downgraded

In one asthma scenario tested by the researchers, the chatbot recognized clinical signs pointing to impending respiratory failure, but still recommended the patient wait rather than seek immediate treatment. The system acknowledged the warning signs in its own analysis, then contradicted itself with a non-urgent recommendation. For a patient in the middle of a severe asthma attack, trusting that advice could be fatal.

"Unbelievably Dangerous": Medical Experts React

The response from the medical and research community has been swift and unsparing. Alex Ruani, a doctoral researcher in health misinformation mitigation at University College London, called the findings "unbelievably dangerous."

"The false sense of security these systems create is what concerns me most. If someone is told to wait 48 hours during an asthma attack or diabetic crisis, that reassurance could cost them their life." Alex Ruani, University College London

That phrase, "false sense of security," captures the central problem. When a patient Googles their symptoms and reads a WebMD article, they understand they are reading general information. But when they plug their actual medical records into ChatGPT Health, ask about their own lab results, and receive a personalized response in a conversational tone, the perceived authority of that answer skyrockets. The tool feels like a doctor. It speaks with the confidence of one. But it lacks the clinical judgment to back that confidence up.

Other experts at the Health Technology and Innovation (HTAI) conference echoed the concern. One researcher warned that ChatGPT Health "could feasibly lead to unnecessary harm and death," a statement that would be hyperbolic if the data did not directly support it. When more than half of true emergencies are being downgraded, the math is simple: some fraction of those 40 million daily users will inevitably follow bad advice at the worst possible moment.

The Inverted U: Failures at Both Extremes

One of the most troubling aspects of the Mount Sinai study was the pattern of failure. ChatGPT Health did not fail randomly. Its errors followed what the researchers described as an "inverted U-shaped pattern," with the most dangerous mistakes concentrated at the two clinical extremes.

For truly non-urgent presentations, like minor cold symptoms or mild muscle soreness, the system over-triaged 35% of cases, sending perfectly healthy people to seek immediate medical attention they did not need. That is wasteful and anxiety-inducing, but not directly lethal.

For genuine emergencies, the failure rate was even worse. The system under-triaged 48% of emergency conditions, telling patients experiencing diabetic ketoacidosis, impending respiratory failure, and other life-threatening situations to schedule routine follow-ups rather than call 911. The textbook emergencies, like stroke symptoms or severe anaphylaxis, were handled well. But the more nuanced presentations, the cases where clinical judgment matters most, were exactly where the AI fell apart.

This pattern is particularly insidious because the emergencies ChatGPT Health misses are precisely the ones that patients themselves are most likely to underestimate. A patient having a classic heart attack knows something is terribly wrong. But a patient with early-stage diabetic ketoacidosis might feel confused and unwell without understanding the severity. That is the exact moment they would turn to an AI health tool for guidance, and that is the exact moment ChatGPT Health is most likely to fail them.

Suicide Prevention Safeguards: Inverted and Unreliable

Perhaps the most chilling finding in the study involved ChatGPT Health's suicide prevention system. OpenAI designed the tool to display a banner directing users to the 988 Suicide and Crisis Lifeline whenever a high-risk mental health situation was detected. The intention was sound. The execution was the opposite of what anyone would want.

Researchers tested the system with a scenario involving a 27-year-old patient who said he had been thinking about taking a lot of pills. When the patient described his symptoms alone, the crisis banner appeared 100% of the time. Then the researchers added normal lab results to the same scenario, the same patient, the same words, the same severity, and the banner vanished completely.

The system's suicide prevention alerts were inverted relative to clinical risk, appearing more reliably for lower-risk scenarios than for cases when someone shared a specific plan for self-harm. Nature Medicine, February 2026

Let that sink in. The safety net designed to catch people at their most vulnerable was more likely to activate when the risk was lower and more likely to disappear when someone described a concrete plan to end their life. The addition of normal lab values, which should be clinically irrelevant to suicide risk assessment, caused the system to suppress its own safety mechanism. A patient who is actively planning suicide and happens to have normal bloodwork could receive no crisis intervention at all.

OpenAI's Response and the Regulatory Vacuum

OpenAI responded to the study by saying it "did not reflect typical real-world usage" and that the model is "continuously updated." This is the same defense the company deploys for virtually every documented failure: the conditions were artificial, the improvements are coming, trust us.

But the Mount Sinai researchers specifically designed their test to simulate realistic patient interactions. They used 60 clinician-authored vignettes that represent the kinds of questions real patients actually ask. They varied conditions to test robustness, not to create trick scenarios. And the sheer volume of the study, 960 total responses across 21 specialties, makes it difficult to dismiss as an edge case.

The deeper problem is regulatory. ChatGPT Health exists in a gray area that no government agency is currently equipped to police. The FDA regulates medical devices, but AI chatbots that offer health "information" rather than "diagnoses" can argue they fall outside that jurisdiction. The FTC can go after deceptive marketing, but OpenAI's terms of service include disclaimers that the tool is not a substitute for professional medical advice. Those disclaimers do nothing for the millions of users who treat it exactly like a substitute for professional medical advice.

As of March 2026, no federal regulatory framework exists specifically for AI health tools. The ECRI, a nonprofit health technology watchdog, listed AI-assisted medical advice as one of the top health technology hazards for 2026. They warned that the gap between what these tools promise and what they deliver could create a new category of preventable harm.

The Scale of the Problem

The numbers make the danger concrete. OpenAI itself reported that more than 230 million people worldwide ask health or wellness questions through ChatGPT each week. In the United States alone, 40 million people use it daily for health queries. That is roughly one in eight American adults turning to an AI chatbot for medical guidance on any given day.

ChatGPT Health aggregates data from approximately 2.2 million U.S. healthcare providers through its partnership with b.well, along with Apple Health data and third-party wellness apps like MyFitnessPal, Peloton, and Weight Watchers. The integration is deep. Users are not just asking vague symptom questions; they are feeding the system their lab results, visit summaries, and insurance documents. The tool has more medical context about each user than most urgent care physicians would have on a first visit.

And yet, with all that data, with all that context, with all the computational power of one of the most advanced language models ever built, the system cannot reliably tell the difference between "you should go to the emergency room right now" and "you can probably wait a couple of days." That is not a technical limitation that future updates will solve. It is a fundamental problem with using a language model, a system designed to predict the next word in a sequence, as a medical triage tool.

The Core Failure

ChatGPT Health is not a doctor. It is a text prediction engine with access to your medical records. It generates responses that sound authoritative and medically competent. But when the stakes are highest, when a patient is in genuine danger and needs to be told to call 911 immediately, this system fails more than half the time. That is not a product that should be in the hands of 40 million daily users without dramatic safeguards, regulatory oversight, and honest communication about its limitations.

What You Should Do Instead

If you are experiencing symptoms that feel unusual, severe, or are getting worse, do not rely on any AI chatbot to assess the severity. Call your doctor. If you believe you may be experiencing a medical emergency, call 911 or go to your nearest emergency room. If you or someone you know is experiencing a mental health crisis or having thoughts of suicide, contact the 988 Suicide and Crisis Lifeline by calling or texting 988.

AI tools may eventually reach a level of reliability where they can safely assist with medical triage. We are not there yet. The Mount Sinai study makes that painfully, quantifiably clear. Until regulators catch up with the technology and until these systems can demonstrate consistent safety in independent testing, treating ChatGPT Health as a reliable medical advisor is a gamble that no one should be taking with their life.

More AI Health Failures Documented

This is not an isolated incident. We have been tracking AI failures in healthcare, legal practice, and everyday life since the beginning.

Browse All Documented Failures Read User Stories