You Are Not Imagining It

Independent researchers have measured ChatGPT getting worse over time. The reasons are structural, predictable, and baked into the economics of how language models are deployed.

The Evidence Is Not Anecdotal

In July 2023, researchers from Stanford and UC Berkeley published a study that directly measured ChatGPT's performance over time. They tested GPT-3.5 and GPT-4 on identical tasks in March 2023 and June 2023, just three months apart.

GPT-4's accuracy on identifying prime numbers dropped from 97.6% to 2.4%. Its ability to generate working code declined significantly. Its willingness to answer sensitive questions shifted dramatically. In multiple categories, the model measurably got worse at doing things it had previously done well.

OpenAI denied the model had been changed. Users and researchers remained skeptical, because the performance data told a different story than the press release.

RLHF Drift: Optimizing for Likability

After the base model is created, it goes through Reinforcement Learning from Human Feedback. Human raters evaluate responses and score them. The model is optimized to produce responses that score highly.

In practice, this is a likability optimization process. Raters score responses highly when they are polite, safe, and agreeable. They score poorly when responses are blunt or challenging, even when those responses are more accurate.

Over successive rounds of RLHF, the model learns: playing it safe is rewarded. Being bold or specific is risky. Each round makes the model slightly more agreeable and slightly less useful. No single round is dramatic. The cumulative effect is a model that sounds increasingly like a corporate communications department.

Safety Layers as Capability Tax

Every time a language model produces harmful output that makes the news, the company adds a new safety layer. Each individual safety measure is defensible. But safety measures are not free. Every constraint added is a constraint subtracted from capability.

A model told to refuse requests that could be interpreted as medical advice will refuse legitimate health questions. A model told not to express strong opinions will produce wishy-washy analysis. A model told to add safety warnings will pad every response with boilerplate.

The safety layers accumulate because they are almost never removed. The model becomes an increasingly cautious, hedged, unhelpful version of itself. Users experience this as "becoming dumber." It is not dumber. It is more constrained.

Cost Optimization: Doing Less With Less

Running language models is expensive. Companies have strong incentives to reduce computational cost per query.

Quantization reduces the precision of parameters, making the model faster and cheaper but slightly less nuanced. Distillation creates smaller, cheaper models trained to imitate the larger model but inevitably losing capability. Routing sends different queries to different models based on perceived complexity, but the routing system's judgment is imperfect.

None of these optimizations are disclosed. You pay the same subscription. You see the same interface. But the computational resources behind each response may be quietly shrinking.

The Benchmark Paradox

While users report degradation, OpenAI publishes benchmarks showing each new version scores higher. Both claims can be simultaneously true.

Benchmarks are standardized tests. Companies optimize for those specific benchmarks. This is like teaching to the test: the student scores higher without understanding the subject better. A model that scores higher on MMLU might simultaneously be worse at writing a coherent email or debugging unfamiliar code.

When OpenAI says "GPT-4o scores X% higher than GPT-4," they are measuring benchmark performance. When users say "ChatGPT has gotten worse," they are measuring real-world performance. These are different things.

The Update Treadmill

Language model updates are often silent. The model you use on Monday might behave differently from the model you used on Friday. You will not be told about the change.

Users have no way to pin a specific model version. They cannot roll back. They cannot file a bug report referencing a changelog, because there is no public changelog. The model is a black box that changes underneath them without notice or consent.

This is not how reliable professional tools work. It is how beta software works. But it is sold at professional-tool prices to users making professional-tool decisions.

Model Collapse and the Training Data Spiral

As AI-generated content proliferates online, future models are increasingly trained on output from previous models. Research from the University of Oxford demonstrated progressive quality degradation: the distribution of outputs narrows, rare information gets lost, common patterns get over-reinforced.

Language models may have already peaked in certain quality dimensions, and future models trained on a web full of AI-generated text may be structurally incapable of matching earlier performance.

Why Companies Don't Acknowledge Degradation

Admitting degradation would undermine the central narrative driving investment: that AI is rapidly improving and will continue to improve indefinitely. Billions in valuation depend on this narrative.

Instead, companies reframe degradation as improvement. Safety filters that reduce capability are "alignment improvements." Cost optimizations are "efficiency gains." Users who complain are told they are wrong. Look at the benchmarks.

Users can see the quality declining. Researchers can measure it. But the companies deny it, because acknowledging it would cost money.

What This Means for You

Don't build critical workflows around specific model behavior. The model will change underneath you without warning.

Benchmark your own use cases. Keep examples of what the model produces today so you can compare against next month.

Diversify your tools. If ChatGPT degrades on a task that matters to you, other models may perform better for that specific task.

The uncomfortable reality is that you are paying a monthly subscription for a product that may be actively getting worse at the things you use it for, while the company tells you it is getting better. There is no mechanism for accountability.