ChatGPT Fails to Detect 92% of Fake Videos Made by OpenAI's Own Sora Tool

Researchers fed ChatGPT AI-generated videos produced by Sora, OpenAI's flagship video model, and asked it to spot the fakes. It missed more than nine out of ten. The company that ships the generator cannot catch the generator, and that is not a minor embarrassment. It is a structural problem dressed up as a product roadmap.

ChatGPT failed to identify 92 percent of fake videos made by OpenAI's own Sora tool, according to research flagged this week by The Decoder. Put differently, when the same company's detector was pointed at the same company's generator, the detector worked fewer than one time in ten. That is not a rounding error. That is a broken premise, held up with duct tape, being sold to regulators and the public as responsible AI.

The finding matters because OpenAI has spent two years telling lawmakers, reporters, and investors that its scale is its safety. Bigger models, the pitch goes, mean richer context, better judgment, and tools capable of identifying the harms generative AI creates. The company that writes the deepfake problem is also, supposedly, the company that solves it. A 92 percent miss rate against its own output is a direct rebuttal to that pitch, delivered by the only audience that matters: reality.

The Setup: How Researchers Tested It

The underlying premise of the test is simple. ChatGPT has been marketed as a multimodal assistant capable of evaluating images and, increasingly, video content. OpenAI has also marketed Sora, its text-to-video model, as a generational leap in synthetic media. The researchers, per The Decoder's reporting, asked a reasonable question: can one of those products catch the other?

The answer was no, overwhelmingly. When shown Sora-generated videos and asked to determine whether they were real or AI-generated, ChatGPT classified them as authentic roughly 92 percent of the time. We are intentionally not inventing specifics about the researcher team, the sample count beyond what has been publicly reported, or the exact prompts used in the experiment. The public summary is the headline number, and the headline number is damning enough without embellishment. A detection system that gets fooled nine out of ten times by its own sibling product is not a detection system. It is a marketing asset.

92%

of Sora-generated videos were classified by ChatGPT as authentic, per research summarized by The Decoder

A 92% Miss Rate Is Not a Glitch, It's a Pattern

For anyone who has been tracking OpenAI's accuracy record through early 2026, this result reads less like a surprise and more like a continuation. In March, Stanford research circulated documenting widespread accuracy degradation in ChatGPT responses. User complaints about hallucinations, fabricated citations, and inconsistent reasoning have been loud enough to move the needle on consumer trust data. The company has shipped a steady drumbeat of new features, while the core reliability metrics that matter to serious users have drifted in the wrong direction.

Against that backdrop, a 92 percent detection failure is not an outlier. It is the visible tip of a broader pattern: OpenAI ships capability first, calibrates safety second, and relies on the sheer pace of releases to keep critics chasing the last headline instead of scrutinizing the current one. The Sora detection result punctures that strategy because it is internally inconsistent in a way even a non-technical reader can see. If ChatGPT cannot tell that Sora made a video, then the safety narrative collapses into a single inconvenient sentence. OpenAI cannot catch OpenAI.

OpenAI Claims vs OpenAI Output

ChatGPT performance when asked to identify Sora-generated video as synthetic

Correctly flagged as AI

Missed (called real)

92%

Source: research coverage summarized by The Decoder, April 2026.

Why OpenAI Can't Catch Its Own AI

There are boring, technical reasons for why detection models struggle to flag outputs from frontier generative systems, and they are worth stating plainly. First, generators and detectors share training data. The exact same corpus of video, captioning, and visual features that teaches Sora to produce convincing frames also shapes how ChatGPT reasons about what a real frame looks like. When the two models inherit the same priors, the detector's sense of authenticity drifts in the same direction the generator's output drifts. They are not adversaries. They are siblings agreeing with each other.

Second, detection and generation are caught in an arms race where generation has a structural advantage. Every improvement to Sora raises the floor on what looks real. Every improvement to ChatGPT's multimodal reasoning needs to clear that new floor, while also not producing so many false positives that the tool becomes useless for ordinary users. In practice, detection systems are tuned to be conservative because crying wolf on real videos is a brand problem. That conservatism shows up as permissiveness, and permissiveness shows up as 92 percent miss rates.

Third, and least charitably, OpenAI has never been commercially incentivized to solve detection. Sora is a revenue product. Sora detection is a compliance bill. The company's public incentives have always pointed toward making generation better, faster, and more accessible. You cannot expect the same company that ships the match to also ship the fire extinguisher and be equally invested in both.

The company that writes the deepfake problem is also, supposedly, the company that solves it. A 92 percent miss rate against its own output is a direct rebuttal to that pitch, delivered by the only audience that matters: reality.

The Policy Stakes

None of this would matter if synthetic video were a niche hobbyist concern. It is not. In the spring of 2026, AI-generated video is actively being used in political advertising, in fraud schemes targeting elderly bank customers, in revenge porn cases that are moving through state courts faster than legislators can draft protections, and in disinformation campaigns now routine enough that newsroom verification desks have had to build entirely new workflows around them. The question of whether a given clip is real or synthetic has left the seminar room and entered the courtroom, the newsroom, and the divorce attorney's office.

Regulators in the United States, the European Union, and Canada have all publicly cited AI detection tooling as a pillar of their enforcement frameworks. The implicit assumption, encouraged by the companies, is that the frontier labs will build credible detection alongside their generators, and that platforms can lean on that detection when making takedown decisions. The 92 percent number blows a hole in that assumption. A platform that relies on ChatGPT to flag Sora content is, statistically, not moderating. It is performing moderation theater.

That matters for election integrity, where a single convincing fake can outrun every correction that follows. It matters for intimate image abuse, where victims need rapid, reliable identification of synthetic content to get it removed. It matters for financial fraud, where AI-generated video calls are already being used to impersonate executives authorizing wire transfers. Each of those use cases has, until now, been told to wait for the detection arms race to mature. The latest round of the race has Sora winning nine to one.

OpenAI's Other Spring 2026 Headaches

The Sora detection story is landing during a stretch that has already been unusually harsh for OpenAI's corporate reputation. In Canada, the government has continued to press the company over its handling of the Tumbler Ridge school shooting case, in which OpenAI flagged and banned an account belonging to a future mass shooter but did not alert any law enforcement agency. The lawsuit from survivor families is active, and federal officials have publicly signaled frustration with the company's response pace.

On the commercial side, Walmart quietly walked away from OpenAI's Checkout integration earlier this spring in favor of its own in-house model, Sparky. That decision, reported by multiple outlets, was read across the retail technology world as a signal that Fortune 50 customers are no longer comfortable building critical infrastructure on top of a provider whose reliability, pricing, and safety posture keep shifting. When the largest retailer in the country concludes it would rather build than buy, the market notices.

Individually, the Tumbler Ridge fallout, the Walmart exit, and the Sora detection failure are three separate stories. Together, they describe a company losing its grip on the two things its valuation has always depended on: trust and technical leadership. OpenAI can survive any one of those crises. The question is whether it can survive them compounding.

What Comes Next

Expect three things in the weeks ahead. First, OpenAI will ship an updated detection pipeline, likely bundled with a Sora release that adds provenance signals, visible watermarks, or cryptographic content credentials. That is the standard industry playbook, and it is not wrong, but it is reactive. Provenance only works if the entire ecosystem, including models not built by OpenAI, agrees to honor it. Sora's competitors have no obligation to cooperate, and many of them already do not.

Second, regulators will use this finding. The 92 percent number is exactly the kind of concrete, cite-able metric that moves a hearing. Expect it to appear in EU AI Act enforcement discussions, in any FTC proceeding that touches synthetic media, and in the next round of state-level deepfake legislation. Numbers like this do not fade. They get entered into the record.

Third, and most importantly, the public narrative around AI safety is shifting from optimistic to forensic. Two years ago, the standard question was what these tools might do if misused. Today, the standard question is what they have already done, and whether the companies building them have any idea. The Sora detection result suggests the answer to the second half of that question is still uncomfortably close to no.

OpenAI will continue to publish safety cards, research notes, and policy blog posts. That is the price of operating as a lightning rod. But the gap between those documents and the empirical performance of the company's actual products keeps widening, and a 92 percent miss rate is not a gap anyone can spin. It is a measurement. Measurements are the part that does not care about narrative.