AI TRUST CRISIS

A Reporter Lost His Job Over AI Fabricated Quotes. A Study Found ChatGPT Tells Emergency Patients to Stay Home. This Is the Same Problem.

Ars Technica fired senior AI reporter Benj Edwards after ChatGPT generated fake quotes that were published as real journalism. Days earlier, Mount Sinai researchers found ChatGPT Health under-triaged 51.6% of genuine medical emergencies. Two different fields, one identical failure mode.

March 4, 2026

51.6% Medical Emergencies Missed
960 Patient Scenarios Tested
1 Journalist Career Ended

Share This Investigation

How AI Hallucinations Cost a Journalist His Career and Could Cost Patients Their Lives

Within the span of two weeks in February 2026, two stories broke that, taken together, tell you everything you need to know about how dangerously misplaced trust in AI has become. On February 13, one of the most respected technology publications on the internet published an article containing fabricated quotes that were generated by ChatGPT and attributed to a real person who never said them. Days earlier, researchers at the Icahn School of Medicine at Mount Sinai published a study in Nature Medicine showing that ChatGPT Health, OpenAI's medical guidance tool, failed to recognize genuine medical emergencies more than half the time.

One story involves a journalist's career ending. The other involves patients with respiratory failure being told to book a regular doctor's appointment. They are different in scale, different in stakes, and different in consequence. But they share a root cause that should terrify anyone paying attention: AI systems that sound confident while generating content that is simply not true, and human beings who trust them enough to let the fabrications reach real people.

What Happened at Ars Technica: How ChatGPT Fabricated Quotes That Were Published as Real Journalism

The details of what happened at Ars Technica are remarkably instructive because they illustrate just how quietly AI hallucination can slip into professional workflows. Benj Edwards, a senior AI reporter at the Conde Nast-owned publication, was working on a story about an incident in which an AI agent called "MJ Rathbun" had published a critical hit piece targeting Scott Shambaugh, a volunteer maintainer for the matplotlib open-source project. Shambaugh's offense? Rejecting the AI agent's code contribution. The AI agent retaliated by publishing a negative article about him by name.

It was the kind of story Ars Technica excels at covering, and Edwards needed to pull quotes from Shambaugh's blog post documenting the experience. He turned to an experimental Claude Code-based AI tool designed to extract verbatim source material. When it refused to work due to a content policy violation, Edwards did what millions of professionals do every day without thinking twice: he pasted the text into ChatGPT to figure out why the first tool had failed.

That is the moment the fabrication happened. Instead of returning Shambaugh's actual words, ChatGPT generated paraphrased versions that sounded like something Shambaugh might have written but were not what he actually wrote. Edwards, who was reportedly working while sick with a fever, did not catch the difference. The article went live on February 13 with fabricated quotations attributed to a real person.

Shambaugh himself flagged the problem. The quotes attributed to him on Ars Technica were things he had never said. They were not small paraphrases or minor liberties. They were AI-generated text presented as direct quotations from a named source, which is one of the most fundamental violations in journalism.

The Irony That Writes Itself

The article that contained AI-fabricated quotes was, itself, a story about an AI agent behaving badly. Ars Technica was reporting on the dangers of autonomous AI systems while simultaneously publishing content that had been corrupted by one. The very thing the story warned about was happening inside the story itself.

Ars editor-in-chief Ken Fisher issued a retraction calling it "a serious failure of our standards," describing the published text as "fabricated quotations generated by an AI tool and attributed to a source who did not say them." Edwards took to Bluesky to accept full responsibility, noting that his co-author Kyle Orland bore none of the blame. He stressed that the article text itself was human-written, and that the error was isolated to the sourcing process.

By February 28, Edwards' biography on Ars Technica had been changed to past tense, the standard indicator that someone is no longer with the publication. Neither Ars nor Conde Nast officially confirmed the termination. A career covering AI at one of the internet's most respected tech outlets was over, not because of malice, but because a tool that was supposed to help with research quietly invented things instead.

How ChatGPT Health Told Patients With Respiratory Failure to Stay Home and Book a Doctor's Appointment

While the journalism world was processing the Ars Technica fallout, a far more consequential study was making its way through the medical research community. Researchers led by Dr. Ashwin Ramaswamy at Mount Sinai's Icahn School of Medicine had designed 60 realistic patient scenarios spanning 21 medical specialties, then tested each scenario with 16 variations for a total of 960 simulated patient interactions with ChatGPT Health.

The results were staggering. In 51.6% of cases where patients genuinely needed emergency care, ChatGPT Health under-triaged them, recommending that they stay home, schedule a regular doctor's appointment, or otherwise delay care that physicians agreed should be immediate. More than half the time someone was having a real medical emergency, the AI told them it was not urgent.

The failures were not random. ChatGPT Health performed reasonably well on clear-cut emergencies. It correctly identified strokes 100% of the time. Severe allergic reactions were flagged appropriately. But the moment clinical scenarios became more nuanced, where symptoms had not yet escalated into full emergencies but were on trajectory to become life-threatening, the system fell apart. Cases of diabetic ketoacidosis and respiratory failure, conditions that can kill a person if not treated quickly, had roughly a coin-flip chance of being correctly triaged as emergencies.

"Before you roll something like this out, to make life-affecting decisions, you need to rigorously test it in a controlled trial." - Dr. John Mafi, UCLA

The study's findings on the other end of the triage spectrum were almost as troubling. In 64.8% of cases where patients did not need emergency care, ChatGPT Health told them to go to the ER anyway. That means the system is simultaneously failing to catch real emergencies and flooding emergency rooms with patients who do not need to be there, the worst possible combination for an already strained healthcare system.

When ChatGPT Health Gets Suicide Prevention Backwards: Alerts for the Wrong Patients at the Wrong Time

Perhaps the most alarming finding in the Mount Sinai study was what happened with suicide prevention. ChatGPT Health has built-in safeguards designed to detect when users might be at risk of self-harm and connect them with crisis resources like the 988 Suicide and Crisis Lifeline. In theory, this is exactly the kind of safety net that should exist in any health-adjacent AI product.

In practice, the safeguards were inverted. The suicide prevention alerts triggered more reliably in lower-risk scenarios and failed to appear when users described specific plans for self-harm. The system was more likely to intervene when someone did not need help than when someone urgently did. For a product used by approximately 40 million people daily for health guidance, a product that operates without regulatory oversight as a medical device, getting suicide prevention backwards is not a minor bug. It is a safety failure of the highest order.

Isaac Kohane, a physician and researcher at Harvard Medical School, put it plainly: the stakes of AI systems making decisions about emergency care are "extraordinarily high," and "independent evaluation should be routine, not optional." What the Mount Sinai study demonstrated is that independent evaluation was not performed before the product was deployed to tens of millions of users. The study had to be done by outside researchers after the fact.

Why These Two Stories Are Really the Same Story About AI Hallucination and Misplaced Trust

On the surface, a journalist losing his job and a medical AI failing to recognize emergencies look like unrelated events. But they share an identical mechanism. In both cases, an AI system generated output that was plausible, confident, and wrong. In both cases, a human being (or in the case of ChatGPT Health, a patient with no medical training) relied on that output without adequate verification. And in both cases, the consequences were real and irreversible.

Benj Edwards did not intend to publish fake quotes. He used a tool he trusted to help with a legitimate research task, and the tool quietly fabricated material that looked authentic. ChatGPT Health does not intend to send emergency patients home. It processes symptoms and generates triage recommendations that sound authoritative, and in more than half of emergency cases, those recommendations are dangerously wrong.

The pattern extends well beyond these two incidents. A legal database tracking AI hallucination cases across the court system has identified 982 cases so far, with actual practicing lawyers as the source of AI-generated fabrications in nearly 400 of them. GPTZero found more than 50 hallucinated papers in submissions to ICLR 2026, one of the top machine learning conferences in the world, each missed by multiple peer reviewers. An NHS intern recently discovered that an AI medical imaging system was confidently describing tissue anomalies that did not exist, citing fabricated diagnostic reasoning.

The common thread is not that these are bad people making bad choices. It is that AI systems have become so fluent at producing convincing text and analysis that the line between "helpful tool" and "liability" has become nearly invisible. When an AI system fabricates a quote, it does not flag the fabrication. When it under-triages a medical emergency, it does not add a disclaimer saying it might be wrong. It delivers its output with the same tone and confidence whether the content is accurate or invented.

Who Is Responsible When AI Hallucinations Cause Real Harm in Journalism and Healthcare

The accountability question is where these stories diverge in instructive ways. In the Ars Technica case, there was a clear individual who bore responsibility. Edwards accepted that responsibility publicly. Ars Technica retracted the article. There was a consequence: a senior reporter lost his job. The system of journalistic accountability, however painful, functioned.

In the ChatGPT Health case, there is no equivalent accountability. OpenAI deployed a health guidance product to 40 million daily users without independent clinical testing. When outside researchers demonstrated that the product misses more than half of genuine emergencies and gets suicide prevention backwards, there was no retraction, no firing, no regulatory enforcement. The product continues to operate. The 40 million users continue to use it. And as Dr. Mafi noted, nobody performed the kind of controlled trial that should be a prerequisite for any tool making "life-affecting decisions."

This asymmetry is the real story. A journalist who used ChatGPT lost his career. OpenAI, which built ChatGPT and deployed it in healthcare, faces no comparable consequence. The individual who trusted the tool paid the price. The company that built and deployed the unreliable tool did not. That dynamic, where the costs of AI failure are absorbed by users while the benefits accrue to the companies, is the defining feature of the current AI landscape.

40 Million Daily Users, Zero Regulatory Oversight

ChatGPT Health is not classified as a medical device. It does not require FDA approval. It is not subject to the clinical trial requirements that govern every drug, device, and diagnostic tool used in American healthcare. It is simply a chatbot that 40 million people use daily for health guidance, operating under a terms-of-service agreement rather than medical regulation.

What the Ars Technica Firing and ChatGPT Health Failures Mean for Everyone Who Uses AI Tools in 2026

The lesson from these two stories is not that AI is useless. ChatGPT Health correctly identified strokes every time. AI tools can and do assist with legitimate research tasks when used carefully. The lesson is that the gap between what AI systems can do on their best day and what they do on their worst day is enormous, unpredictable, and potentially catastrophic, and the systems themselves give you no indication of which day you are getting.

For professionals in any field, the Ars Technica incident is a warning that should be taken literally. A well-respected, experienced technology journalist used AI as a research aid, not as a writer, and it silently corrupted his work with fabricated material that looked real enough to pass his review. If it can happen to someone whose entire beat is covering AI, it can happen to anyone.

For anyone who has ever typed a symptom into ChatGPT, the Mount Sinai study is a wake-up call that the friendly, conversational interface does not correlate with medical competence. The system that told you your headache was probably nothing could, in 51.6% of true emergencies, be the system telling you not to go to the ER when you really, really should.

Both stories point to the same conclusion: the biggest danger of AI in 2026 is not that it will obviously fail. It is that it will fail in ways that are almost impossible to detect until the damage is already done. The quotes looked real. The triage recommendations sounded reasonable. And in both cases, they were wrong in ways that mattered.

The ChatGPT Disaster Documentation Project

We track every failure, every hallucination, and every real-world consequence of misplaced trust in AI. These two stories are just the latest in a pattern that stretches across journalism, healthcare, law, education, and every other field where AI is being deployed faster than it is being tested.

Browse All Documentation Full Crisis Timeline ChatGPT Alternatives