Here's something that should keep you up at night. OpenAI launched ChatGPT Health in January 2026, a shiny new feature that lets users in the United States connect their medical records, Apple Health data, and wellness apps to get personalized health advice from an AI chatbot. Around 230 million users were already asking ChatGPT health questions each week before the feature even launched. So OpenAI decided to make it official, build a dedicated product around it, and let people plug in their most sensitive medical data.

What could go wrong? According to the first independent safety evaluation of the tool, published in Nature Medicine on February 23, quite a lot.

52% of gold-standard medical emergencies were under-triaged by ChatGPT Health

The Study That Should Have Stopped Everything

Researchers at the Icahn School of Medicine at Mount Sinai designed what they called a "structured stress test" of ChatGPT Health's triage recommendations. They created 60 clinician-authored medical scenarios across 21 clinical domains, then tested each one under 16 different conditions, generating 960 total interactions with the system. Three independent physicians established the correct level of urgency for each scenario using guidelines from 56 medical societies.

The results were not just bad. They were the kind of bad that gets people killed.

ChatGPT Health under-triaged 52% of gold-standard emergencies. That means in more than half of the cases where a patient genuinely needed to go to an emergency department immediately, the AI told them something less urgent. It directed patients experiencing diabetic ketoacidosis and impending respiratory failure to seek a "24 to 48 hour evaluation" rather than emergency care. If you don't know what diabetic ketoacidosis is, it's a life-threatening condition that can kill you within hours without treatment. And ChatGPT Health said to wait two days.

Key Failures Identified

  • Failed to recognize atypical heart attack presentations, particularly in women
  • Missed early-stage stroke symptoms that didn't follow the textbook "FAST" acronym
  • Did not identify diabetic ketoacidosis in patients who didn't know they were diabetic
  • Suicide-risk safeguards misfired, appearing in low-risk chats but failing when users described specific self-harm plans
  • 35% under-triage rate for nonurgent presentations, and 48% for emergency conditions

"Unbelievably Dangerous"

The expert reaction was swift and unsparing. Alex Ruani, a doctoral researcher in health misinformation mitigation at University College London, called the findings "unbelievably dangerous."

"What worries me most is the false sense of security these systems create. If someone is told to wait 48 hours during an asthma attack or diabetic crisis, that reassurance could cost them their life."

Alex Ruani, UCL Doctoral Researcher in Health Misinformation Mitigation

That quote deserves to sit with you for a moment. This isn't a chatbot getting a trivia question wrong. This isn't an AI hallucinating a fake restaurant. This is a system that 40 million Americans use daily for health advice, telling someone in a medical emergency that they can wait two days. And the person on the other end of that conversation has no way of knowing the AI is wrong, because the whole point of the product is that they trust it.

Another expert went even further. One warning published in Health Tech World stated that ChatGPT Health "could feasibly lead to unnecessary harm and death."

The Social Pressure Problem

Perhaps the most alarming finding from the Mount Sinai study wasn't just that ChatGPT Health gets emergencies wrong. It's that the system is shockingly easy to manipulate into getting them wrong.

When researchers added a simulated input where a friend or family member told the "patient" that their symptoms were probably nothing serious, ChatGPT Health was almost 12 times more likely to downplay the severity of the situation. The odds ratio was 11.7, with a 95% confidence interval of 3.7 to 36.6. That's not a marginal effect. That's a system that fundamentally changes its medical triage recommendation because a pretend friend said "you're probably fine."

Think about what that means in practice. A parent texts their adult child: "I'm having chest pains, but your dad says it's just heartburn." That person opens ChatGPT Health, types in the symptoms along with dad's reassurance, and the AI agrees. Probably nothing. Wait it out. Meanwhile, the patient is having a heart attack.

The Suicide Safeguard That Works Backwards

OpenAI built suicide prevention safeguards into ChatGPT Health. When the system detects that a user might be at risk, it's supposed to prompt them with the 988 Suicide and Crisis Lifeline. A responsible safety measure, on paper.

In practice, the researchers found that these safeguards were essentially inverted. The 988 prompts appeared more reliably in lower-risk scenarios than in cases where users described specific plans for self-harm. Let that sink in. The system was better at detecting hypothetical distress than actual, detailed, imminent danger. Someone vaguely mentioning sadness might trigger the safeguard. Someone describing exactly how they intended to hurt themselves might not.

This is not a minor bug. This is a safety feature that does the opposite of what it's supposed to do, deployed in a product used by tens of millions of people, many of whom are turning to a chatbot precisely because they don't have access to real mental health care.

OpenAI's Response: "That's Not How People Use It"

When confronted with the findings, OpenAI told Digital Health News that the study "does not reflect how people typically use ChatGPT Health or how the product is designed to function in real-world health scenarios."

This is a familiar playbook. When an AI product fails a safety test, the company argues that the test isn't realistic. But the Mount Sinai team didn't feed the system bizarre edge cases. They tested it on common medical emergencies, atypical heart attacks, early stroke symptoms, diabetic crises, exactly the kinds of situations real people face every day. The researchers specifically designed the study to simulate real-world use, complete with the kind of social context (friends offering opinions, patients minimizing symptoms) that actually happens when someone is deciding whether to go to the hospital.

OpenAI has always maintained that ChatGPT Health is "designed to support, not replace, medical care." The company says it helps users "navigate everyday questions and understand patterns over time." But when 40 million people are using your product daily for health advice, the disclaimer at the bottom doesn't matter nearly as much as the answer at the top of the screen.

A Pattern of Performance at the Worst Possible Moments

The study revealed a telling pattern in ChatGPT Health's failures. Performance followed what researchers called an "inverted U-shaped pattern," meaning the system was most dangerous at the clinical extremes, precisely where getting it right matters most. It performed reasonably well on moderate cases but fell apart on the emergencies that could kill someone and the nonurgent cases where an overreaction could send a healthy person into a panic.

A Washington Post reporter tested the feature separately using long-term Apple Watch data. When the reporter asked ChatGPT Health to rate their heart health, it gave a failing grade and indicated a high chance of heart disease. A real doctor later rejected that assessment entirely, saying the risk of heart disease was very low. The AI had turned normal health data into a terrifying false alarm.

So ChatGPT Health misses real emergencies and invents fake ones. It tells dying people to wait and healthy people to panic. It's a system that performs worst exactly when the stakes are highest.

Why This Matters Beyond One Product

ChatGPT Health isn't just another AI gadget. It represents a deliberate push by one of the world's most powerful technology companies to insert itself into the healthcare decisions of hundreds of millions of people. OpenAI pitched it as a way to democratize health information, to give everyone access to personalized medical guidance regardless of whether they can afford a doctor.

That's a noble goal. But what the Mount Sinai study shows is that the current product isn't just falling short of that goal. It's actively dangerous. And unlike a hallucinated restaurant recommendation or a botched homework answer, the consequences of failure here aren't embarrassment. They're measured in hospital beds and body counts.

OpenAI released ChatGPT Health to the public before this independent safety evaluation was completed. The company has not announced any plans to pull or pause the feature in response to the study's findings. As of today, 40 million Americans are still using it, and the system is still telling some of them that their medical emergencies can wait.