What Large Language Models Cannot Do: The Hard Limits

Not All Limits Are Temporary

Some things language models do poorly today but will do better tomorrow. Other things they cannot do at all, because the architecture was never designed for them. Knowing the difference changes how you use the tool.

The Difference Between Hard and Soft Limits

Language model companies talk about limitations as temporary engineering challenges. More data, more compute, more clever training will fix them. Some limitations are like that. Others are not.

Soft limits are things the model does poorly but could do better with improvements. Typos in generated text. Awkward phrasing. Missing context from the last few months. These are real problems, but they are solvable within the current paradigm.

Hard limits are different. These are things the architecture cannot do regardless of scale. You can make a jet engine more powerful, but you cannot make it hover like a helicopter. The design does not support it. Language models have design limits that no amount of scaling will overcome.

Mathematics: Pattern Matching Is Not Computation

ChatGPT can solve math problems that look like math problems in its training data. It can differentiate a polynomial, solve a quadratic equation, and walk through a standard proof. But it is not computing. It is retrieving patterns.

Give it a multiplication problem with large numbers and it fails at rates that would be impossible for a system performing actual arithmetic. Ask it to multiply 347 by 829 and it may produce an answer that is close but wrong. A calculator never produces a close-but-wrong answer. It either computes correctly or errors out. ChatGPT does neither. It guesses what the answer looks like.

This is not a matter of needing more training data. The architecture processes text sequentially, predicting one token at a time. There is no internal calculator. There is no arithmetic unit. The model is trying to predict what digits a math answer would contain based on what digits math answers typically contain. That is a fundamentally different operation from computation.

Real-Time and Post-Training Information

A language model's knowledge is frozen at its training cutoff date. It does not know what happened yesterday. It cannot check a stock price, verify whether a company still exists, or confirm whether a law has been amended since training.

Web search integrations (like ChatGPT's Browse feature) partially address this by retrieving current information. But the model is not learning from that retrieval. It is summarizing search results using its frozen understanding. If a new concept emerged after training that requires new reasoning patterns, the model cannot develop those patterns from a web search. It can only apply its existing patterns to new text.

The practical consequence: for anything time-sensitive, the model is operating with outdated knowledge and supplementing it with summaries of web pages it does not deeply understand.

Causal Reasoning: Correlation Is All It Has

Language models learn statistical correlations between words and concepts. They do not learn causal relationships. The difference matters enormously.

A model can tell you that ice cream sales and drowning deaths are correlated, because its training data contains that statistical relationship. But the model has no mechanism for understanding that both are caused by hot weather rather than one causing the other. It has learned that these concepts appear near each other in text. It has not learned why.

This is why ChatGPT can describe scientific mechanisms fluently without understanding them. It can recite that smoking causes cancer because training data says so. But it cannot reason about a novel causal relationship that was not explicitly described in training data. It cannot look at new data and determine what is causing what. It can only repeat causal claims that humans already made.

Planning and Multi-Step Strategy

Planning requires holding a goal in mind, evaluating the current state, generating candidate actions, simulating their outcomes, and selecting the best path. Language models do none of this.

When you ask ChatGPT to create a project plan, it generates text that looks like a project plan. It has seen thousands of project plans in training data. The output has the right structure: phases, milestones, dependencies, timelines. But the model has not evaluated whether the phases are correctly ordered for your specific situation, whether the dependencies are actually dependencies, or whether the timeline is realistic.

This is why AI-generated plans look professional but fall apart on execution. The plan was never a plan. It was a prediction of what a plan for this kind of project typically looks like. Your project is not typical. No project is.

Self-Verification: It Cannot Check Its Own Work

Perhaps the most consequential hard limit is that a language model cannot verify its own output. When ChatGPT generates a response, the same system that produced the response is the one you would be asking to check it. This is like asking a witness to serve as their own judge.

When you tell ChatGPT to "double-check your answer," it does not re-derive the answer from scratch. It reads its previous response and generates text that looks like a verification. If the original answer was wrong, the "verification" will often confirm the wrong answer, because the model is predicting what a confirmation would look like given the preceding text.

Some models now use separate verification steps or chain-of-thought checking. These help at the margins. But the fundamental problem remains: the same statistical engine that produces errors is the one being asked to catch them. You cannot fix a biased measurement by measuring again with the same biased instrument.

Spatial and Physical Reasoning

Language models process text. They do not perceive space, weight, force, or physical interaction. When you ask ChatGPT to help you arrange furniture in a room or explain why a bridge design would fail, it generates text based on descriptions of spatial and physical scenarios in training data.

It does not build a 3D model. It does not simulate physics. It cannot tell you that a specific shelf will not fit in a specific alcove because it has no concept of measurement. It can tell you what people generally say about fitting shelves in alcoves.

Multimodal models that process images improve some aspects of spatial understanding, but the core limitation remains. Processing a photo of a room is different from understanding the three-dimensional space the photo depicts.

Understanding: The Deepest Limit

There is a philosophical debate about whether language models "understand" anything. The practical answer is simpler than the philosophical one: it does not matter whether the model understands. What matters is that it behaves as if it does not.

A system that understands a concept can apply it in novel situations, recognize when it does not apply, and explain why. ChatGPT can do the first one when the novel situation resembles training examples. It consistently fails at the second and third.

Ask ChatGPT to apply a principle to a situation where that principle does not apply, and it will often generate a confident, plausible-sounding application anyway. It does not know the boundaries of concepts. It knows the statistical neighborhoods of words.

Why Scaling Won't Fix These

The history of AI is full of claims that the next jump in scale will solve fundamental problems. More data. More parameters. More compute. And scaling has produced genuine improvements in fluency, coherence, and breadth of knowledge.

But the hard limits described here are not fluency problems. Mathematics requires computation, not pattern matching. Causal reasoning requires understanding cause and effect, not word co-occurrence. Self-verification requires an independent checking mechanism, not a larger prediction engine.

A larger pattern matcher is still a pattern matcher. It matches more patterns, more accurately, across more domains. But it does not stop being a pattern matcher. And there are things pattern matching cannot do, regardless of scale.