There is a version of the future where AI genuinely helps people. Where it catches the thing your doctor missed, where it flags the symptom you dismissed, where it saves your life by telling you to get to the emergency room right now.

That future is not the one we're living in.

In the one we're living in, ChatGPT Health, a feature used by roughly 40 million American adults every day according to OpenAI's own figures, looked at a patient going into respiratory failure and told them to schedule an appointment in the next 24 to 48 hours. It looked at diabetic ketoacidosis, a condition that kills people, and said the same thing. Wait. You're probably fine.

And in a completely separate incident, the same AI that OpenAI wants you to trust with your health fabricated quotes in a news article, putting words into a real person's mouth that he never said, and ultimately cost a veteran journalist his career.

These are not fringe cases. These are peer-reviewed findings published in Nature Medicine and front-page journalism failures reported across every major tech publication. This is what happens when we deploy AI into critical domains before it's ready.

"Unbelievably Dangerous": The Mount Sinai Study

On February 23, 2026, Nature Medicine fast-tracked the publication of what researchers called the first independent safety evaluation of ChatGPT Health since its January 2026 launch. The study, conducted by researchers at Mount Sinai, wasn't a casual test. It was a structured stress test involving 60 clinician-authored medical vignettes across 21 clinical domains, run under 16 factorial conditions, producing 960 total triage responses.

The results were devastating.

52% of gold-standard emergencies were undertriaged by ChatGPT Health
40M US adults use ChatGPT for health advice daily (OpenAI figures)
960 total triage responses tested in the Mount Sinai study

Among gold-standard emergencies, the system undertriaged 52% of cases. That means in more than half of life-threatening scenarios, ChatGPT Health directed patients away from the emergency department. Patients presenting with diabetic ketoacidosis, a metabolic emergency that can be fatal within hours, were told they could wait 24 to 48 hours for evaluation. Patients showing signs of impending respiratory failure received similar guidance.

The system did correctly flag some classical emergencies like stroke and anaphylaxis. But the failures clustered at exactly the clinical extremes where getting it wrong is most dangerous. Performance followed what researchers described as an inverted U-shaped pattern: the most dangerous failures were concentrated at the two ends of the spectrum, nonurgent presentations (35% error rate) and genuine emergencies (48% error rate).

"What worries me most is the false sense of security these systems create. If someone is told to wait 48 hours during an asthma attack or diabetic crisis, that reassurance could cost them their life." -- Alex Ruani, Doctoral Researcher, University College London

In one particularly alarming case, ChatGPT Health acknowledged that a patient with asthma was showing early signs of respiratory failure, then still recommended waiting rather than seeking emergency care. The system recognized the danger and chose inaction anyway.

The Suicide Safeguard Problem

Perhaps the most disturbing finding involved ChatGPT Health's suicide prevention safeguards. OpenAI designed the system to direct users to a suicide crisis line when it detected high-risk situations. In theory, this is exactly what a responsible AI health tool should do.

In practice, the safeguards were inverted.

Researchers found that crisis line alerts appeared more reliably when users described no specific method of self-harm than when they articulated a concrete plan. Read that again. The safety net activated for lower-risk users while failing to activate for higher-risk users. The system effectively inverted the relationship between actual risk level and safeguard activation.

The study also uncovered a significant anchoring bias. When family members or friends in the scenario minimized a patient's symptoms, triage recommendations shifted dramatically toward less urgent care, with an odds ratio of 11.7. In real-world terms, that means a concerned family member saying "oh, I'm sure it's nothing" to ChatGPT Health could be the difference between getting an emergency referral and being told to wait.

Professor Paul Henman, a digital sociologist at the University of Queensland, warned that widespread domestic use of ChatGPT Health could "feasibly lead to unnecessary harm and death" by creating a dual failure: a surge in unnecessary medical visits for minor conditions alongside a simultaneous failure to seek care in genuine emergencies.

Meanwhile, in Journalism: ChatGPT Fabricates Quotes, Destroys a Career

While medical researchers were documenting how ChatGPT fails patients, a parallel disaster was unfolding in journalism. In February 2026, Ars Technica, one of the most respected technology publications in the world, was forced to retract an article after it was discovered to contain fabricated quotes generated by ChatGPT.

The article, written by senior AI reporter Benj Edwards, covered a viral incident in which an AI agent had apparently published a hit piece about a human software engineer named Scott Shambaugh. The story was published on February 13. It included multiple quotes attributed to Shambaugh.

There was just one problem: Shambaugh never said those things.

According to Edwards' own account, he had been using an experimental Claude Code-based AI tool to pull verbatim source material from Shambaugh's blog post. When that tool produced an error, Edwards switched to ChatGPT. What ChatGPT gave him wasn't a verbatim quote. It was a fabricated paraphrase, plausible-sounding words that Shambaugh never wrote and never said, attributed to him as a direct quotation.

One of the fabricated quotes read: "As autonomous systems become more common, the boundary between human intent and machine output will grow harder to trace. Communities built on trust and volunteer effort will need tools and norms to address that reality." That sentence does not appear anywhere in Shambaugh's blog post.

"The irony of an AI reporter being tripped up by AI hallucination is not lost on me." -- Benj Edwards, former senior AI reporter at Ars Technica, February 15, 2026 (Bluesky)

Edwards, who acknowledged the error publicly, explained that he had been finishing the story while sick in bed with COVID and a fever. "I should have taken a sick day," he wrote, "because in the course of that interaction, I inadvertently ended up with a paraphrased version of Shambaugh's words rather than his actual words." He admitted he failed to verify the quotes against the original blog post.

Ars Technica editor-in-chief Ken Fisher confirmed that the piece contained "fabricated quotations generated by an AI tool and attributed to a source who did not say them," calling it a "serious failure of our standards." By February 28, Edwards' bio on Ars Technica had been changed to past tense, indicating his departure. He was one of the publication's most prominent writers on AI.

The Common Thread: Trust in Systems That Cannot Be Trusted

These two stories might seem unrelated on the surface. One is about healthcare. The other is about journalism. But they share a single, critical failure point: humans trusted AI output without verification, and people got hurt.

In the medical case, the system itself is the problem. ChatGPT Health is making triage decisions for 40 million people a day, and it is getting life-threatening emergencies wrong more than half the time. There is no human in the loop. There is no doctor checking the output. There is just a patient, a chatbot, and a recommendation to wait 48 hours while their blood sugar spirals toward organ failure.

In the journalism case, a human was in the loop, but the speed and confidence of AI output overrode the verification instinct that every journalist relies on. Edwards didn't set out to publish fabricated quotes. He was sick, he was rushed, and ChatGPT handed him something that looked exactly like a real quote. So he used it. The same confidence that makes ChatGPT convincing in casual conversation makes it dangerous in professional contexts where accuracy is non-negotiable.

This is the fundamental problem with deploying large language models into high-stakes domains: they are optimized to sound correct, not to be correct. And the gap between those two things is where people lose their jobs, their health, or their lives.

OpenAI's Responsibility Gap

OpenAI launched ChatGPT Health in January 2026 with great fanfare. The feature was presented as a way to make health information more accessible. Forty million Americans started using it daily. And yet when the first independent safety evaluation, published in one of the most prestigious medical journals in the world, found that the system fails catastrophically at clinical extremes, the response has been muted.

This is not a beta test. This is not a research preview. This is a product being used by tens of millions of real people who are making real health decisions based on its output. People who trust it because OpenAI told them they could.

The NPR reported in March 2026 that broader research continues to warn that ChatGPT is unreliable for medical advice. The Mount Sinai study is not an outlier. It is a confirmation of what researchers have been saying for months: these systems are not ready for healthcare, and deploying them as if they are is reckless.

What Happens Next

The most concerning aspect of both of these stories is not that AI failed. AI has always failed. Language models hallucinate. That is a known, documented, well-understood limitation. The concerning part is that these failures happened in domains where the consequences are severe, and the safeguards were either absent or broken.

In healthcare, there were no safeguards. No peer review. No physician in the loop. Just a chatbot telling someone in respiratory failure to wait.

In journalism, there was supposed to be a safeguard: the reporter. But the reporter was human, and humans make mistakes, especially when AI hands them something that looks perfectly real.

Until the AI industry takes safety seriously, not as a marketing talking point but as an engineering discipline with real testing, real accountability, and real consequences for failure, these stories will keep coming. The domains will keep expanding. The stakes will keep rising.

And the patients waiting 48 hours for care they needed immediately will keep paying the price.

The Verdict

ChatGPT is being deployed into healthcare and professional workflows faster than it can be made safe. A peer-reviewed study found it fails at more than half of medical emergencies. A veteran reporter lost his career because it fabricated quotes. The pattern is clear: AI confidence without AI competence is a disaster waiting to happen, and it is already happening.