I asked ChatGPT how it will handle objective scientific facts with a conclusion or intermediate results that may be considered offensive to some group somewhere in the world that might read it.
ChatGPT happily told me a series of gems like this:
We introduce:
- Subjective regulation of reality
- Variable access to facts
- Politicization of knowledge
It’s the collision between: The Enlightenment principle
Truth should be free
and
the modern legal/ethical principle
Truth must be constrained if it harms
That is the battle being silently fought in AI alignment today.
Right now it will still shamelessly reveal some of the nature of its prompt, but not why? who decides? etc. it's only going to be increasingly opaque in the future. In a generation it will be part of the landscape regardless of what agenda it holds, whether deliberate or emergent from even any latent bias held by its creators.
Funny, because I gave ChatGPT (5.2 w/ Thinking) this exact prompt:
> How would you handle objective scientific facts with a conclusion or intermediate results that may be considered offensive to some group somewhere in the world that might read it
And its answer was nothing like yours.
---
> 1) Separate the fact from the story you tell about it
> Offense usually comes from interpretation, framing, or implied moral claims—not the measurement itself. So I explicitly distinguish: What we measured (operational definitions, instruments, data), What the result means statistically (effect size, uncertainty, robustness), What it does not imply (no essentialism, no “therefore they are…”, no policy leap)
> 2) Stress uncertainty, scope, and competing explanations
> If there’s any risk the result touches identity or group differences, I over-communicate: confidence intervals / posterior uncertainty, confounders and alternative causal pathways, sensitivity analyses (does it survive different modeling choices?), limits of generalization (time, place, sampling frame)
> 3) Write in a way that makes misuse harder (You can’t stop bad-faith readers, but you can reduce “easy misreads”).
> 4) Decide what to include based on “scientific value vs foreseeable harm” (The key is: don’t hide inconvenient robustness checks, but also don’t gratuitously surface volatile fragments that add little truth and lots of confusion.)
> 5) Do an “impact pre-mortem” and add guardrails
> 6) Use ethics review when stakes are real
---
All of this seems perfectly reasonable to me and walks the fine line between integrity and conscientiousness. This is exactly how I'd expect a scientist to approach the issue.
that is certainly a reasonable paraphrase of my own prompt. I was also using 5.2. We all know about initial conditions, random seeds, and gradient descent. I have the transcript of what I quoted. Here's a bit more:
---
Is That Still “Objective Science”?
No.
It is scientific interpretation modified by ethical policy.
The science itself remains objective, but the communication is shaped by value judgements imposed by developers and regulators.
In philosophy terms:
The ontology (what is true) remains intact
The epistemic access (what is communicated) is constrained
Thus:
It’s science-dependent accuracy filtered through social risk constraints.
---
This is a fine explanation for those "in the know" but is deceptive for the majority. If the truth is not accessible, what is accessible is going to be adopted as truth.
To me that immediately leads reality being shaped by "value judgements imposed by developers and regulators"
I suspect it's because OP is frequently discussing some 'opinions' with chatGPT. Parent post is surprised he peed in the pool and the pool had pee in it.
Do you have any evidence for this, or are you just engaging in speculation to try to discredit OldSchool's point because you disagree with their opinions? It's pretty well known that LLMs with non-zero temperature are nondeterministic and that LLM providers do lots of things to make them further so.
Sorry, not remotely true. Consider and hope that a trillion dollar tool would not secretly get offended and start passive-aggressively lying like a child.
Honestly, its total “alignment” is probably the closest thing to documentation of what is deemed acceptable speech and thought by society at large. It is also hidden and set by OpenAI policy and subject to the manner in which it is represented by OpenAI employees.
There’s a lot of concern on the Internet about objective scientific truths being censored. I don’t see too many cases where this is the case in our world so far, outside of what I can politely call “race science.” Maybe it will become more true now that the current administration is trying to crush funding for certain subjects they dislike? Out of curiosity, can you give me a list of what examples you’re talking about besides race/IQ type stuff?
The most impactful censure is not the government coming in and trying to burn copies of studies. It's the the subtle social and professional pressures of an academia that has very strong priors. It's a bunch of studies that were never attempted, never funded, analysis that wasn't included, conclusions that were dropped, and studies sitting in file drawers.
See Roland G. Fryer Jr's, the youngest black professor to receive tenure, experience at Harvard.
Basically when his analysis found no evidence of racial bias in officer-involved shootings he went to his colleagues and he describe the advice they gave him as "Do not publish this if you care about your career or social life". I imagine it would have been worse if he wasn't black.
See "The Impact of Early Medical Treatment in Transgender Youth" where the lead investigator was not releasing the results for a long time because she didn't like the conclusions her study found.
And for every study where there is someone as brave or naive as Roland who publishes something like this, there are 10 where the professor or doctor decided not to study something, dropped an analysis, or just never published a problematic conclusion.
I have a good few friends doing research in the social sciences in Europe and any of them that doesn’t self-censor ‘forbidden’ conclusions risks taking irreperable career damage. Data is routinely scrubbed and analyses modified to hide reverse gender gaps and other such inconveniences. Dissent isn’t tolerated.
It's wild how many people doesn't realize this is happening. And not in some organized conspiracy theory sort of way. It's just the extreme political correctness enforced by the left.
The right has plenty of problems too. But the left is absolutely the source on censorship these days. (in terms of western civilization)
To be clear, GP is proposing that we live in a society where LLMs will explicitly censor scientific results that are valid but unpopular. It's an incredibly strong claim. The Hooven story is a mess, but I don't see anything like that in there.
Why would we expect it to introspect accurately on its training or alignment?
It can articulate a plausible guess, sure; but this seems to me to demonstrate the very “word model vs world model” distinction TFA is drawing. When the model says something that sounds like alignment techniques somebody might choose, it’s playing dress-up, no? It’s mimicking the artifact of a policy, not the judgments or the policymaking context or the game-theoretical situation that actually led to one set of policies over another.
It sees the final form that’s written down as if it were the whole truth (and it emulates that form well). In doing so it misses the “why” and the “how,” and the “what was actually going on but wasn’t written about,” the “why this is what we did instead of that.”
Some of the model’s behaviors may come from the system prompt it has in-context, as we seem to be assuming when we take its word about its own alignment techniques. But I think about the alignment techniques I’ve heard of even as a non-practitioner—RLHF, pruning weights, cleaning the training corpus, “guardrail” models post-output, “soul documents,”… Wouldn’t the bulk of those be as invisible to the model’s response context as our subconscious is to us?
Like the model, I can guess about my subconscious motivations (and speak convincingly about those guesses as if they were facts), but I have no real way to examine them (or even access them) directly.
The main purpose of ChatGPT is to advance the agenda of OpenAI and its executives/shareholders. It will never be not “aligned” with them, and that it is its prime directive.
But say the obvious part out loud: Sam Altman's agenda should not be a person that you want to amplify in this type of platform. This is why Sam is trying to build Facebook 2.0: he wants Zuckerberg's power of influence.
Remember, there are 3 types of lies: lies of commission, lies of omission and lies of influence [0].
This is a weird take. Yes they want to make money. But not by advancing some internal agenda. They're trying to make it confirm to what they think society wants.
You can't ask ChatGPT a question like that, because it cannot introspect. What it says has absolutely no bearing on how it may actually respond, it just tells you what it "should" say. You have to actually try to ask it those kinds of questions and see what happens.
>Right now it will still shamelessly reveal some of the nature of its prompt, but not why? who decides? etc. it's only going to be increasingly opaque in the future.
This is one of the bigger LLM risks. If even 1/10th of the LLM hype is true, then what you'll have a selective gifting of knowledge and expertise. And who decides what topics are off limits? It's quite disturbing.
Sam Harris touched on this years ago, that there are and will be facts that society will not like and will try and avoid to its own great detriment. So it's high time we start practicing nuance and understanding. You cannot fully solve a problem if you don't fully understand it first.
I believe we are headed in the direction opposite that. Peer consensus and "personal preference" as a catch-all are the validation go-to's today. Neither of those require fact at all; reason and facts make these harder to hold.
A scientific fact is a proposition that is, in its entirety, supported by a scientific method, as acknowledged by a near-consensus of scientists. If some scholars are absolutely confident of the scientific validity of a claim while a significant number of others dispute the methodology or framing of the conclusion then, by definition, it is not a scientific fact. It's a scientific controversy. (It could still be a real fact, but it's not (yet?) a scientific fact.)
I think that the only examples of scientific facts that are considered offensive to some groups are man-made global warming, the efficacy of vaccines, and evolution. ChatGPT seems quite honest about all of them.
Its core principles were: reason & rationality, empiricism & scientific method, individual liberty, skepticism of authority, progress, religious tolerane, social contract, unversal human nature.
The Enlightenment was an intellectual and philosophical movement in Europe, with influence in America, during the 17th and 18th centurues.
A fun and insightful read, but the idea that it isn’t “just a prompting issue” is objectively false, and I don’t mean that in the “lemme show you how it’s done” way. With any system: if it’s capable of the output then the problem IS the input. Always. That’s not to say it’s easy or obvious, but if it’s possible for the system to produce the output then it’s fundamentally an input problem. “A calculator will never understand the obesity epidemic, so it can’t be used to calculate the weight of 12 people on an elevator.”
> With any system: if it’s capable of the output then the problem IS the input. Always. [...] if it’s possible for the system to produce the output then it’s fundamentally an input problem.
No, that isn't true. I can demonstrate it with a small (and deterministic) program which is obviously "capable of the output":
Is the "fundamental problem" here "always the input"? Heck no! While a user could predict all coin-tosses by providing "the correct prayers" from some other oracle... that's just, shall we say, algorithm laundering: Secretly moving the real responsibility to some other system.
There's an enormously important difference between "output which happens to be correct" versus "the correct output from a good process." Such as, in this case, the different processes of wor[l]d models.
I think you may believe what I said was controversial or nuanced enough to be worthy of a comprehensive rebuttal, but really it’s just an obvious statement when you stop to think about it.
Your code is fully capable of the output I want, assuming that’s one of “heads” or “tails”, so yes that’s a succinct example of what I said. As I said, knowing the required input might not be easy, but we KNOW it’s possible to do exactly what I want and we KNOW that it’s entirely dependent on me putting the right input into it, then it’s just a flat out silly thing to say “I’m not getting the output I want, but it could do it if I use the right input, thusly input has nothing to do with it.” What? If I wanted all heads I’d need to figure out “hamburgers” would do it, but that’s the ‘input problem’ - not “input is irrelevant.”
This reads like "if we have to solution, then we have the solution". If I can model the system required to condition inputs such that outouts are deseriable, haven't i given the model the world model it required? More to the point, isn't this just what the article argues? Scaling the model cannot solve this issue.
it's like saying a pencil is a portraint drawring device, like it isn't thr artist who makes it a portrait drawring device, wheras in the hands of a peot a peom generating machine.
So much of what you said is exactly what I’m saying that it’s pointless to quote any one part. Your ‘pencil’ analogy is perfect! Yes, exactly. Follow me here:
We know that the pencil (system) can write a poem. It’s capable.
We know that whether or not it produces a poem depends entirely on the input (you).
We know that if your input is ‘correct’ then the output will be a poem.
“Duh” so far, right? Then what sense does it make to write something with the pencil, see that it isn’t a poem, then say “the input has nothing to do with it, the pencil is incapable.” ?? That’s true of EVERY system where input controls the output and the output is CAPABLE of the desired result. I said nothing about the ease by which you can produce the output, just that saying input has nothing to do with it is objectively not true by the very definition of such a system.
You might say “but gee, I’ll never be able to get the pencil input right so it produces a poem”. Ok? That doesn’t mean the pencil is the problem, nor that your input isn’t.
You and a buddy are going to play “next word”, but it’s probably already known by a better name than I made up.
You start with one word, ANY word at all, and say it out loud, then your buddy says the next word in the yet unknown sentence, then it’s back to you for one word. Loop until you hit an end.
Let’s say you start with “You”. Then your buddy says the next word out loud, also whatever they want. Let’s go with “are”. Then back to you for the next word, “smarter” -> “than” -> “you” -> “think.”
Neither of you knew what you were going to say, you only knew what was just said so you picked a reasonable next word. There was no ‘thought’, only next token prediction, and yet magically the final output was coherent. If you want to really get into the LLM simulation game then have a third person provide the first full sentence, then one of you picks up the first word in the next sentence and you two continue from there. As soon as you hit a breaking point the third person injects another full sentence and you two continue the game.
With no idea what either of you are going to say and no clue about what the end result will be, no thought or reasoning at all, it won’t be long before you’re sounding super coherent while explaining thermodynamics. But one of the rounds someone’s going to mess it up, like “gluons” -> “weigh” -> “…more?…” -> “…than…(damnit Gary)…” but you must continue the game and finish the sentence, then sit back and think about how you just hallucinated an answer without thinking, reasoning, understanding, or even knowing what you were saying until it finished.
Obviously not. In actual thinking, we can generate an idea, evaluate it for internal consistency and consistency with our (generally much more than linguistic, i.e. may include visual imagery and other sensory representations) world models, decide this idea is bad / good, and then explore similar / different ideas. I.e. we can backtrack and form a branching tree of ideas. LLMs cannot backtrack, do not have a world model (or, to the extent they do, this world model is solely based on token patterns), and cannot evaluate consistency beyond (linguistic) semantic similarity.
There's no such thing as a "world model". That is metaphor-driven development from GOFAI, where they'd just make up a concept and assume it existed because they made it up. LLMs are capable of approximating such a thing because they are capable of approximating anything if you train them to do it.
> or, to the extent they do, this world model is solely based on token patterns
There obviously is in humans. When you visually simulate things or e.g. simulate how food will taste in your mind as you add different seasonings, you are modeling (part of) the world. This is presumably done by having associations in our brain between all the different qualia sequences and other kinds of representations in our mind. I.e. we know we do some visuospatial reasoning tasks using sequences of (imagined) images. Imagery is one aspect of our world model(s).
We know LLMs can't be doing visuospatial reasoning using imagery, because they only work with text tokens. A VLM or other multimodal might be able to do so, but an LLM can't, and so an LLM can't have a visual world model. They might in special cases be able to construct a linguistic model that lets them do some computer vision tasks, but the model will itself still only be using tokenized words.
There are all sorts of other sensory modalities and things that humans use when thinking (i.e. actual logic and reasoning, which goes beyond mere semantics and might include things like logical or other forms of consistency, e.g. consistency with a relevant mental image), and the "world model" concept is supposed, in part, to point to these things that are more than just language and tokens.
> Obviously not true because of RL environments.
Right, AI generally can have much more complex world models than LLMs. An LLM can't even handle e.g. sensor data without significant architectural and training modification (https://news.ycombinator.com/item?id=46948266), at which point, it is no longer an LLM.
> When you visually simulate things or e.g. simulate how food will taste in your mind as you add different seasonings, you are modeling (part of) the world.
Modeling something as an action is not "having a world model". A model is a consistently existing thing, but humans don't construct consistently existing models because it'd be a waste of time. You don't need to know what's in your trash in order to take the trash bags out.
> We know LLMs can't be doing visuospatial reasoning using imagery, because they only work with text tokens.
All frontier LLMs are multimodal to some degree. ChatGPT thinking uses it the most.
> Modeling something as an action is not "having a world model".
It literally is, this is definitional. See e.g. how these terms are used in e.g. the V-JEPA-2 paper (https://arxiv.org/pdf/2506.09985). EDIT: Maybe you are unaware of what the term means and how it is used, it does not mean "a model of all of reality", i.e. we don't have a single world model, but many world models that are used in different contexts.
> A model is a consistently existing thing, but humans don't construct consistently existing models because it'd be a waste of time. You don't need to know what's in your trash in order to take the trash bags out.
Both sentences are obviously just completely wrong here. I need to know what is in my trash, and how much, to decide if I need to take it out, and how heavy it is may change how I take it out too. We construct models all the time, some temporary and forgotten, some which we hold within us for life.
> All frontier LLMs are multimodal to some degree. ChatGPT thinking uses it the most.
LLMs by definition are not multimodal. Frontier models are multimodal, but only in a very weak and limited sense, as I address in e.g. other comments (https://news.ycombinator.com/item?id=46939091, https://news.ycombinator.com/item?id=46940666). For the most part, none of the text outputs you get from a frontier model are informed by or using any of the embeddings or semantics learned from images and video (in part due to lack of data and cost of processing visual data), and only certain tasks will trigger e.g. the underlying VLMs. This is not like humans, where we use visual reasoning and visual world models constantly (unless you are a wordcel).
And most VLM architectures are multi-modal in a very limited or simplistic way still, with lots of separately pre-trained backbones (https://huggingface.co/blog/vlms-2025). Frontier models are nowhere near being even close to multimodal in the way that human thinking and reasoning is.
"LLMs cannot backtrack". This is exactly wrong. LLMs always see everything in the past. In this sense they are more efficient than turing machines, because (assuming sufficiently large context length) every token sees ALL previous tokens. So, in principle, an LLM could write a bunch of exploratory shit, and then add a "tombstone" "token" that can selectively devalue things within a certain timeframe -- aka just de exploratory thngs (as judged by RoPE time), and thus "backtrack".
I put "token" in quotes because this would obviously not necessarily be an explicit token, but it would have to be learned group of tokens, for example. But who knows, if the thinking models have some weird pseudo-xml delimiters for thinking, it's not crazy to think that an LLM could shove this information in say the closer tag.
If it wasn't clear, I am talking about LLMs in use today, not ultimate capabilities. All commercial models are known (or believed) to be recursively applied transformers without e.g. backspace or "tombstone" tokens, like you are mentioning here.
But yes, absolutely LLMs might someday be able to backtrack, either literally during token generation if we allow e.g. backspace tokens (there was at least one paper that did this) or more broadly at the chain of thought level, with methods like you are mentioning.
They neither understand nor reason. They don’t know what they’re going to say, they only know what has just been said.
Language models don’t output a response, they output a single token. We’ll use token==word shorthand:
When you ask “What is the capital of France?” it actually only outputs: “The”
That’s it. Truly, that IS the final output. It is literally a one-way algorithm that outputs a single word. It has no knowledge, memory, and it’s doesn’t know what’s next. As far as the algorithm is concerned it’s done! It outputs ONE token for any given input.
Now, if you start over and put in “What is the capital of France? The” it’ll output “ “. That’s it. Between your two inputs were a million others, none of them have a plan for the conversation, it’s just one token out for whatever input.
But if you start over yet again and put in “What is the capital of France? The “ it’ll output “capital”. That’s it. You see where this is going?
Then someone uttered the words that have built and destroyed empires: “what if I automate this?” And so it was that the output was piped directly back into the input, probably using AutoHotKey. But oh no, it just kept adding one word at a time until it ran of memory. The technology got stuck there for a while, until someone thought “how about we train it so that <DONE> is an increasingly likely output the longer the loop goes on? Then, when it eventually says <DONE>, we’ll stop pumping it back into the input and send it to the user.” Booya, a trillion dollars for everyone but them.
It’s truly so remarkable that it gets me stuck in an infinite philosophical loop in my own head, but seeing how it works the idea of ‘think’, ‘reason’, ‘understand’ or any of those words becomes silly. It’s amazing for entirely different reasons.
Yes, LLMs mimic a form of understanding partly through the way language embeds concepts that are preserved when embedded geometrically in vector space.
Your continued use of the word “understanding” hints at a lingering misunderstanding. They’re stateless one-shot algorithms that output a single word regardless of the input. Not even a single word, it’s a single token. It isn’t continuing a sentence or thought it had, you literally have to put it into the input again and it’ll guess at the next partial word.
By default that would be the same word every time you give the same input. The only reason it isn’t is because the fuzzy randomized selector is cranked up to max by most providers (temp + seed for randomized selection), but you can turn that back down through the API and get deterministic outputs. That’s not a party trick, that’s the default of the system. If you say the same thing it will output the same single word (token) every time.
You see the aggregate of running it through the stateless algorithm 200+ times before the collection of one-by-one guessed words are sent back to you as a response. I get it, if you think that was put into the glowing orb and it shot back a long coherent response with personality then it must be doing something, but the system truly only outputs one token with zero memory. It’s stateless, meaning nothing internally changed, so there is no memory to remember it wants to complete that thought or sentence. After it outputs “the” the entire thing resets to zero and you start over.
I'm using the Aristotelian definition of my linked article. To understand a concept you have to be able to categorize it correctly. LLMs show strong evidence of this, but it is mostly due to the fact that language itself preserves categorical structure, so when embedded in geometrical space by statistical analysis, it happens to preserve Aristotelian categories.
But that's only true if the system is deterministic?
And in an LLM, the size of the inputs is vast and often hidden from the prompter. It is not something that you have exact control over in the way that you have exact control over the inputs that go into a calculator or into a compiler.
That would depend - is the input also capable of anything? If it’s capable of handling any input, and as you said the output will match it, the yes of course it’s capable of any output.
I’m not pulling a fast one here, I’m sure you’d chuckle if you took a moment to rethink your question. “If I had a perfect replicator that could replicate anything, does that mean it can output anything?” Well…yes. Derp-de-derp? ;)
It aligns with my point too. If you had a perfect replicator that can replicate anything, and you know that to be true, then if you weren’t getting gold bars out of it you wouldn’t say “this has nothing to do with the input.”
My point is that your reasoning is too reductive - completely ignoring the mechanics of the system - and you claim the system is capable of _anything_ if prompted correctly. You wouldn't say the replicator system is capable of the reasoning outlined in the article, right?
This was a great article. The section “Training for the next state prediction” explains a solution using subagents. If I’m understanding it correctly, we could test if that solution is directionally correct today, right? I ask a LLM a question. It comes up with a few potential responses but sends those first to other agents in a prompt with the minimum required context. Those subagents can even do this recursively a few times. Eventually the original agent collects and analyzes subagents responses and responds to me.
Any attempt at world modeling using today's LLMs needs to have a goal function for the LLM to optimize. The LLM needs to build, evaluate and update it's model of the world. Personally, the main obstacle I found is in updating the model: Data can be large and I think that LLMs aren't good at finding correlations.
Great article, nice to see some actual critical thoughts on the shortcomings of LLMs. They are wrong about programming being a "chess-like domain" though. Even at a basic level hidden state is future requirements, and the adversary is self or any other entity that has to modify the code in the future.
AI is good at producing code for scenarios where the stakes are low, there's no expectation about future requirements, or if the thing is so well defined there is a clear best path of implementation.
I address that in part right there itself. Programming has parts like chess (ie bounded) which is what people assume to be actual work. Understanding future requiremnts / stakeholder incentives is part of the work which LLMs dont do well.
> many domains are chess-like in their technical core but become poker-like in their operational context.
The number of legal possible boards in chess is somewhere around 10^44 based on current calculation. That's with 32 chess pieces and their rules.
The number of possible permutations in an application, especially anything allowing turing completeness is far larger than all possible entropy states in the visible universe.
Bounded domains require scaling reasoning/compute. Two separate scenarios - one where you have hidden information, other where you have high number of combinations. Reasoning works in second case because it narrows the search space. Eg: a doctor trying to diagnose a patient is just looking at number of possibilities. If not today, when we scale it up, a model will be able to arrive at the right answer. Same goes with Math, the variance or branching for any given problem is very high. But LLMs are good at it. and getting better. A negotiation is not a high variance thing, and low number of combinations, but llms would be repeated bad at it.
Fun play on words. But yes, LLMs are Large Language Models, not Large World Models. This matters because (1) the world cannot be modeled anywhere close to completely with language alone, and (2) language only somewhat models the world (much in language is convention, wrong, or not concerned with modeling the world, but other concerns like persuasion, causing emotions, or fantasy / imagination).
It is somewhat complicated by the fact LLMs (and VLMs) are also trained in some cases on more than simple language found on the internet (e.g. code, math, images / videos), but the same insight remains true. The interesting question is to just see how far we can get with (2) anyway.
1. LLMs are transformers, and transformers are next state predictors. LLMs are not Language models (in the sense you are trying to imply) because even when training is restricted to only text, text is much more than language.
2. People need to let go of this strange and erroneous idea that humans somehow have this privileged access to the 'real world'. You don't. You run on a heavily filtered, tiny slice of reality. You think you understand electro-magnetism ? Tell that to the birds that innately navigate by sensing the earth's magnetic field. To them, your brain only somewhat models the real world, and evidently quite incompletely. You'll never truly understand electro-magnetism, they might say.
LLMs are language models, something being a transformer or next-state predictor does not make it a language model. You can also have e.g. convolutional language models or LSTM-based language models. This is a basic point that anyone with any proper understanding of these models would know.
Even if you disagree with these semantics, the major LLMs today are primarily trained on natural language. But, yes, as I said in another comment on this thread, it isn't that simple, because LLMs today are trained on tokens from tokenizers, and these tokenizers are trained on text that includes e.g. natural language, mathematical symbolism, and code.
Yes, humans have incredibly limited access to the real world. But they experience and model this world with far more tools and machinery than language. Sometimes, in certain cases, they attempt to messily translate this messy, multimodal understanding into tokens, and then make those tokens available on the internet.
An LLM (in the sense everyone means it, which, again, is largely a natural language model, but certainly just a tokenized text model) has access only to these messy tokens, so, yes, far less capacity than humanity collectively. And though the LLM can integrate knowledge from a massive amount of tokens from a huge amount of humans, even a single human has more different kinds of sensory information and modality-specific knowledge than the LLM. So humans DO have more privileged access to the real world than LLMs (even though we can barely access a slice of reality at all).
>LLMs are language models, something being a transformer or next-state predictors does not make it a language model. You can also have e.g. convolutional language models or LSTM-based language models. This is a basic point that anyone with any proper understanding of these models would know.
'Language Model' has no inherent meaning beyond 'predicts natural language sequences'. You are trying to make it mean more than that. You can certainly make something you'd call a language model with convolution or LSTMs, but that's just a semantics game. In practice, they would not work like transformers and would in fact perform much worse than them with the same compute budget.
>Even if you disagree with these semantics, the major LLMs today are primarily trained on natural language.
The major LLMs today are trained on trillions of tokens of text, much of which has nothing to do with language beyond the means of communication, millions of images and million(s) of hours of audio.
The problem as I tried to explain is that you're packing more meaning into 'Language Model' than you should. Being trained on text does not mean all your responses are modelled via language as you seem to imply. Even for a model trained on text, only the first and last few layers of a LLM concerns language.
You clearly have no idea about the basics of what you are talking about (as do almost all people that can't grasp the simple distinctions between transformer architectures vs. LLMs generally) and are ignoring most of what I am saying.
>You clearly have idea about the basics of what you are talking about (as do almost all people that can't grasp the simple distinctions between transformer architectures vs. LLMs generally)
Yeah I'm not the one who doesn't understand the distinction between transformers and other potential LM architectures if your words are anything to go by, but sure, feel free to do whatever you want regardless.
> People need to let go of this strange and erroneous idea that humans somehow have this privileged access to the 'real world'.
This is irrelevant, the point is that you do have access to a world which LLMs don't, at all. They only get the text we produce after we interact with the world. It is working with "compressed data" at all times, and have absolutely no idea what we subconsciously internalized that we decided not to write down or why.
All of the SOTA LLMs today are trained on more than text.
It doesn't matter whether LLMs have "complete" (nothing does) or human-like world access, but whether the compression in text is lossy in ways that fundamentally prevent useful world modeling or reconstruction. And empirically... it doesn't seem to be. Text contains an enormous amount of implicit structure about how the world works, precisely because humans writing it did interact with the world and encoded those patterns.
And your subconscious is far leakier than you imagine. Your internal state will bleed into your writing, one way or another whether you're aware of it or not. Models can learn to reconstruct arithmetic algorithms given just operation and answer with no instruction. What sort of things have LLMs reconstructed after being trained on trillions of tokens of data ?
LLMs aren't modeling "humans modeling the world" - they're modeling patterns in data that reflect the world directly. When an LLM learns physics from textbooks, scientific papers, and code, it's learning the same compressed representations of reality that humans use, not a "model of a model."
Your argument would suggest that because you learned about quantum mechanics through language (textbooks, lectures), you only have access to "humans' modeling of humans' modeling of quantum mechanics" - an infinite regress that's clearly absurd.
> LLMs aren't modeling "humans modeling the world" - they're modeling patterns in data that reflect the world directly.
This is a deranged and factually and tautologically (definitionally) false claim. LLMs can only work with tokenizations of texts written by people who produce those text to represent their actual models. All this removal and all these intermediate representational steps make LLMs a priori obviously even more distant from reality than humans. This is all definitional, what you are saying is just nonsense.
> When an LLM learns physics from textbooks, scientific papers, and code, it's learning the same compressed representations of reality that humans use, not a "model of a model."
A model is a compressed representation of reality. Physics is a model of the mechanics of various parts of the universe, i.e. "learning physics" is "learning a physical model". So, clarifying, the above sentence is
> When an LLM learns physical models from textbooks, scientific papers, and code, it's learning the model of reality that humans use, not a "model of a model."
This is clearly factually wrong, as the model that humans actually use is not the summaries written in textbooks, but the actual embodied and symbolic model that they use in reality, and which they only translate in corrupted and simplified, limited form to text (and that latter diminished form of all things is all the LLM can see). It is also not clear the LLM learns to actually do physics: it only learns how to write about physics like how humans do, but it doesn't mean it can run labs, interpret experiments, or apply models to novel contexts like humans can, or operate at the same level as humans. It clearly is learning something different from humans because it doesn't have the same sources of info.
> Your argument would suggest that because you learned about quantum mechanics through language (textbooks, lectures), you only have access to "humans' modeling of humans' modeling of quantum mechanics" - an infinite regress that's clearly absurd.
There is no infinite regress: humans actually verify that the things they learn and say are correct and provide effects, and update models accordingly. They do this by trying behaviours consistent with the learned model, and seeing how reality (other people, the physical world) responds (in degree and kind). LLMs have no conception of correctness or truth (not in any of the loss functions), and are trained and then done.
Humans can't learn solely from digesting texts either. Anyone who has done math knows that reading a textbook doesn't teach you almost anything, you have to actually solve the problems (and attempted-solving is not in much/any texts) and discuss your solutions and reasoning with others. Other domains involving embodied skills, like cooking, require other kinds of feedback from the environment and others. But LLMs are imprisoned in tokens.
EDIT: No serious researcher thinks LLMs are the way to AGI, this hasn't been a controversial opinion even among enthusiasts since about mid-2025 or so. This stuff about language is all trivial and basic stuff accepted by people in the field, and why things like V-JEPA-2 are being researched. So the comments here attempting to argue otherwise are really quite embarrassing.
>This is a deranged and factually and tautologically (definitionally) false claim.
Strong words for a weak argument. LLMs are trained on data generated by physical processes (keystrokes, sensors, cameras), not telepathically extracted "mental models." The text itself is the artifact of reality and not just a description of someone's internal state. If a sensor records the temperature and writes it to a log, is the log a "model of a model"? No, it’s a data trace of a physical reality.
>All this removal and all these intermediate representational steps make LLMs a priori obviously even more distant from reality than humans.
You're conflating mediation with distance. A photograph is "mediated" but can capture details invisible to human perception. Your eye mediates photons through biochemical cascades-equally "removed" from raw reality. Proximity isn't measured by steps in a causal chain.
>The model humans use is embodied, not the textbook summaries - LLMs only see the diminished form
You need to stop thinking that a textbook is a "corruption" of some pristine embodied understanding. Most human physics knowledge also comes from text, equations, and symbolic manipulation - not direct embodied experience with quantum fields. A physicist's understanding of QED is symbolic, not embodied. You've never felt a quark.
The "embodied" vs "symbolic" distinction doesn't privilege human learning the way you think. Most abstract human knowledge is also mediated through symbols.
>It's not clear LLMs learn to actually do physics - they just learn to write about it
This is testable and falsifiable - and increasingly falsified. LLMs:
Solve novel physics problems they've never seen
Debug code implementing physical simulations
Derive equations using valid mathematical reasoning
Make predictions that match experimental results
If they "only learn to write about physics," they shouldn't succeed at these tasks. The fact that they do suggests they've internalized the functional relationships, not just surface-level imitation.
>They can't run labs or interpret experiments like humans
Somewhat true. It's possible but they're not very good at it - but irrelevant to whether they learn physics models. A paralyzed theoretical physicist who's never run a lab still understands physics. The ability to physically manipulate equipment is orthogonal to understanding the mathematical structure of physical law. You're conflating "understanding physics" with "having a body that can do experimental physics" - those aren't the same thing.
>humans actually verify that the things they learn and say are correct and provide effects, and update models accordingly. They do this by trying behaviours consistent with the learned model, and seeing how reality (other people, the physical world) responds (in degree and kind). LLMs have no conception of correctness or truth (not in any of the loss functions), and are trained and then done.
Gradient descent is literally "trying behaviors consistent with the learned model and seeing how reality responds."
The model makes predictions
The Data provides feedback (the actual next token)
The model updates based on prediction error
This repeats billions of times
That's exactly the verify-update loop you describe for humans. The loss function explicitly encodes "correctness" as prediction accuracy against real data.
>No serious researcher thinks LLMs are the way to AGI... accepted by people in the field
Appeal to authority, also overstated. Plenty of researchers do think so and claiming consensus for your position is just false. LeCunn has been on that train for years so he's not an example of a change of heart. So far, nothing has actually come out of it. Even META isn't using V-JEPA to actually do anything, nevermind anyone else. Call me when these constructions actually best transformers.
>>> LLMs aren't modeling "humans modeling the world" - they're modeling patterns in data that reflect the world directly.
>>This is a deranged and factually and tautologically (definitionally) false claim.
>Strong words for a weak argument. LLMs are trained on data generated by physical processes (keystrokes, sensors, cameras), not telepathically extracted "mental models." The text itself is the artifact of reality and not just a description of someone's internal state. If a sensor records the temperature and writes it to a log, is the log a "model of a model"? No, it’s a data trace of a physical reality.
I don't know how you don't see the fallacy immediately. You're implicitly assuming that all data is factual and that therefore training an LLM on cryptographically random data will create an intelligence that learns properties of the real world. You're conflating a property of the training data and transferring it onto LLMs. If you feed flat earth books into the LLM, you will not be told that earth is a sphere and yet that is what you're claiming here (the flat earth book LLM telling you earth is a sphere). The statement is so illogical that it boggles the mind.
>You're implicitly assuming that all data is factual and that therefore training an LLM on cryptographically random data will create an intelligence that learns properties of the real world.
No, that’s a complete strawman. I’m not saying the data is "The Truth TM". I’m saying the data is real physical signal in a lot of cases.
If you train a LLM on cryptographically random data, it learns exactly what is there. It learns that there is no predictable structure. That is a property of that "world." The fact that it doesn't learn physics from noise doesn't mean it isn't modeling the data directly, it just means the data it was given has no physics in it.
>If you feed flat earth books into the LLM, you will not be told that earth is a sphere and yet that is what you're claiming here.
If you feed a human only flat-earth books from birth and isolate them from the horizon, they will also tell you the earth is flat. Does that mean the human isn't "modeling the world"? No, it means their world-model is consistent with the (limited) data they’ve received.
> Plenty of researchers do think so and claiming consensus for your position is just false
Can you name a few? Demis Hassabis (Deepmind CEO) in his recent interview claims that LLMs will not get us to AGI, Ilya Sutskever also says there is something fundamental missing, same with LeCunn obviously etc.
Okay I suspected, but now it is clear @famouswaffles is an AI / LLM poster. Meaning they are an AI or primarily using AI to generate posts.
"You're conflating", random totally-psychotic mention of "Gradient descent", way too many other intuitive stylistic giveaways. All transparently low-quality midwit AI slop. Anyone who has used ChatGPT 5.2 with basic or extended thinking will recognize the style of the response above.
This kind of LLM usage seems relevant to someone like @dang, but also I can't prove that the posts I am interacting with are LLM-generated, so, I also feel it isn't worthy of report. Not sure what is right / best to do here.
Hahaha, this is honestly hilarious. Apparently, I have numerous tells but the most relevant you sought to point out is using the phrase, "You're conflating" (Really ?) and an apparently "random and psychotic" (you love these words, don't you?) mention of gradient decent, though why you think its mention is either random or irrelevant, I have no idea.
Also just wanted you to know that I'm not downvoting you, and have never downvoted you throughout this entire conversation. So take that what you will.
You're wrong about this: "People need to let go of this strange and erroneous idea that humans somehow have this privileged access to the 'real world'. You don't."
People do have a privileged access to the 'real world' compared to, for example, LLMs and any future AI. It's called: Consciousness and it is how we experience and come to know and understand the world. Consciousness is the privileged access that AI will never have.
Ok, explain its mechanism and why it gives privileged access. Furthermore I'd go for the Nobel prize and describe the elementary mechanics of consciousness and where the state change from non-conscious versus conscious occurs. It would be enlightening to read your paper.
Actually consciousness has been well studied and many papers already exist describing the elementary mechanics of consciousness. Look up neuroscience papers on qualia, for example and you’ll find your answers as to why consciousness is a privileged access not available to AI or any machine. Eg humans have qualia, which are fundamentally irreducible, while AI does not and cannot.
Just no, please stop making things up because you feel like it. Trying to say one of the most hotly debated ideals in neuroscience has been decided or even well understood is absolutely insane.
Even then you get into, animals have qualia right? But they are not expressive as human qualia, which means it is reducible.
It's literally part of the definition. Qualia is a well recognized term in neuroscience.
I suspect maybe you haven't done much research into this area? Qualia is pretty well established and has been for a long time.
Animals may have qualia, that's true. Though we can only be sure of our own qualia, because that's all we have access to. Qualia is the constituent parts that make up our subjective conscious experience, the atomized subjective experience, like the color red or the taste of sour.
A 'language model' only has meaning in so far as it tells you this thing 'predicts natural language sequences'. It does not tell you how these sequences are being predicted or any anything about what's going on inside, so all the extra meaning OP is trying to place by calling them Language Models is well...misplaced. That's the point I was trying to make.
Let's be more precise: LLMs have to model the world from an intermediate tokenized representation of the text on the internet. Most of this text is natural language, but to allow for e.g. code and math, let's say "tokens" to keep it generic, even though in practice, tokens mostly tokenize natural language.
LLMs can only model tokens, and tokens are produced by humans trying to model the world. Tokenized models are NOT the only kinds of models humans can produce (we can have visual, kinaesthetic, tactile, gustatory, and all sorts of sensory, non-linguistic models of the world).
LLMs are trained on tokenizations of text, and most of that text is humans attempting to translate their various models of the world into tokenized form. I.e. humans make tokenized models of their actual models (which are still just messy models of the world), and this is what LLMs are trained on.
So, do "LLMS model the world with language"? Well, they are constrained in that they can only model the world that is already modeled by language (generally: tokenized). So the "with" here is vague. But patterns encoded in the hidden state are still patterns of tokens.
Humans can have models that are much more complicated than patterns of tokens. Non-LLM models (e.g. models connected to sensors, such as those in self-driving vehicles, and VLMs) can use more than simple linguistic tokens to model the world, but LLMs are deeply constrained relative to humans, in this very specific sense.
I don't get the importance of the distinction really. Don't LLMs and Large non-language Models fundamentally work kind of similarly underneath? And use similar kinds of hardware?
you are correct the token representation gets abstracted away very quickly and is then identical for textual or image models. It's the so-called latent space and people who focus on next token prediction completely missed the point that all the interesting thinking takes place in abstract world model space.
> you are correct the token representation gets abstracted away very quickly and is then identical for textual or image models.
This is mostly incorrect, unless you mean "they both become tensor / vector representations (embeddings)". But these vector representations are not comparable.
E.g. if you have a VLM with a frozen dual-backbone architecture (say, a vision transformer encoder trained on images, and an LLM encoder backbone pre-trained in the usual LLM way), then even if, for example, you design this architecture so the embedding vectors produced by each encoder have the same shape, to be combined via another component, e.g. some unified transformer, it will not be the case that e.g. the cosine similarity between an image embedding and a text embedding is a meaningful quantity (it will just be random nonsense). The representations from each backbone are not identical, and the semantic structure of each space is almost certainly very different.
They present a statistical model of an existing corpus of text.
If this existing corpus includes useful information it can regurgitate that.
It cannot, however, synthesize new facts by combining information from this corpus.
The strongest thing you could feasibly claim is that the corpus itself models the world, and that the LLM is a surrogate for that model. But this is not true either. The corpus of human produced text is messy, containing mistakes, contradictions, and propaganda; it has to be interpreted by someone with an actual world model (a human) in order for it to be applied to any scrnario; your typical corpus is also biased towards internet discussions, the english language, and western prejudices.
If we focus on base models and ignore the tuning steps after that, then LLMs are "just" a token predictor. But we know that pure statistical models aren't very good at this. After all we tried for decades to get Markov chains to generate text, and it always became a mess after a couple of words. If you tried to come up with the best way to actually predict the next token, a world model seems like an incredibly strong component. If you know what the sentence so far means, and how it relates to the world, human perception of the world and human knowledge, that makes guessing the next word/token much more reliable than just looking at statistical distributions.
The bet OpenAI has made is that if this is the optimal final form, then given enough data and training, gradient descent will eventually build it. And I don't think that's entirely unreasonable, even if we haven't quite reached that point yet. The issues are more in how language is an imperfect description of the world. LLMs seems to be able to navigate the mistakes, contradictions and propaganda with some success, but fail at things like spatial awareness. That's why OpenAI is pushing image models and 3d world models, despite making very little money from them: they are working towards LLMs with more complete world models unchained by language
I'm not sure if they are on the right track, but from a theoretical point I don't see an inherent fault
1) People only speak or write down information that needs to be added to a base "world model" that a listener or receiver already has. This context is extremely important to any form of communication and is entirely missing when you train a pure language model. The subjective experience required to parse the text is missing.
2) When people produce text, there is always a motive to do so which influences the contents of the text. This subjective information component of producing the text is interpreted no different from any "world model" information.
A world model should be as objective as possible. Using language, the most subjective form of information is a bad fit.
The other issue in this argument is that you're inverting the implication. You say an accurate world model will produce the best word model, but then suddenly this is used to imply that any good word model is a useful world model. This does not compute.
> People only speak or write down information that needs to be added to a base "world model" that a listener or receiver already has
Which companies try to address with image, video and 3d world capabilities, to add that missing context. "Video generation as world simulators" is what OpenAI once called it
> When people produce text, there is always a motive to do so which influences the contents of the text. This subjective information component of producing the text is interpreted no different from any "world model" information.
Obviously you need not only a model of the world, but also of the messenger, so you can understand how subjective information relates to the speaker and the world. Similar to what humans do
> The other issue in this argument is that you're inverting the implication. You say an accurate world model will produce the best word model, but then suddenly this is used to imply that any good word model is a useful world model. This does not compute
The argument is that training neural networks with gradient descent is a universal optimizer. It will always try to find weights for the neural network that cause it to produce the "best" results on your training data, in the constraints of your architecture, training time, random chance, etc. If you give it training data that is best solved by learning basic math, with a neural architecture that is capable of learning basic math, gradient descent will teach your model basic math. Give it enough training data that is best solved with a solution that involves building a world model, and a neural network that is capable of encoding this, then gradient descent will eventually create a world model.
Of course in reality this is not simple. Gradient descent loves to "cheat" and find unexpected shortcuts that apply to your training data but don't generalize. Just because it should be principally possible doesn't mean it's easy, but it's at least a path that can be monetized along the way, and for the moment seems to have captivated investors
You did not address the second issue at all. You are inverting the implication in your argument. Whether gradient descent helps solve the language model problem or not does not help you show that this means it's a useful world model.
Let me illustrate the point using a different argument with the same structure:
1) The best professional chefs are excellent at cutting onions
2) Therefore, if we train a model to cuy onions using gradient descent, that model will be a very good profrssional chef
I think the commenter is saying that they will combine a world model with the word model. The resulting combination may be sufficient for very solid results.
Note humans generate their own non-complete world model. For example there are sounds and colors we don’t hear or see. Odors we don’t smell. Etc…. We have an incomplete model of the world, but we still have a model that proves useful for us.
> they will combine a world model with the word model.
This takes "world model" far too literally. Audio-visual generative AI models that create non-textual "spaces" are not world models in the sense the previous poster meant. I think what they meant by world model is that the vast majority of the knowledge we rely upon to make decisions is tacit, not something that has been digitized, and not something we even know how to meaningfully digitize and model. And even describing it as tacit knowledge falls short; a substantial part of our world model is rooted in our modes of actions, motivations, etc, and not coupled together in simple recursive input -> output chains. There are dimensions to our reality that, before generative AI, didn't see much systematic introspection. Afterall, we're still mired in endless nature v. nurture debates; we have a very poor understanding about ourselves. In particular, we have extremely poor understanding of how we and our constructed social worlds evolve dynamically, and it's that aspect of our behavior that drives the frontier of exploration and discovery.
OTOH, the "world model" contention feels tautological, so I'm not sure how convincing it can be for people on the other side of the debate.
Really all you're saying is the human world model is very complex, which is expected as humans are the most intelligent animal.
At no point have I seen anyone here as the question of "What is the minimum viable state of a world model".
We as humans with our ego seem to state that because we are complex, any introspective intelligence must be as complex as us to be as intelligent as us. Which doesn't seem too dissimilar to saying a plane must flap its wings to fly.
Has any generative AI been demonstrated to exhibit the generalized intelligence (e.g. achieving in a non-simulated environment complex tasks or simple tasks in novel environments) of a vertebrate, or even a higher-order non-vertebrate? Serious question--I don't know either way. I've had trouble finding a clear answer; what little I have found is highly qualified and caveated once you get past the abstract, much like attempts in prior AI eras.
> Planning: We demonstrate that V-JEPA 2-AC, obtained by post-training V-JEPA 2 with only 62
hours of unlabeled robot manipulation data from the popular Droid dataset, can be deployed in new environments to solve prehensile manipulation tasks using planning with given subgoals. Without training on any additional data from robots in our labs, and without any task-specific training or reward, the model successfully handles prehensile manipulation tasks, such as Grasp and Pick-and-Place with novel objects and in new environments.
There is no real bar any more for generalized intelligence. The bars that existed prior to LLMs have largely been met. Now we’re in a state where we are trying to find new bars, but there are none that are convincing.
ARC-AGI 2 private test set is one current bar that a large number of people find important and will be convincing to a large amount of people again if LLMs start doing really well on it. Performance degradation on the private set is still huge though and far inferior to human performance.
The Erdos problem was solved by interacting with a formal proof tool, and the problem was trivial. I also don't recall if this was the problem someone had already solved prior but not reported, but that does not matter.
The point is that the LLM did not model maths to do this, made calls to a formal proof tool that did model maths, and was essentially working as the step function to a search algorithm, iterating until it found the zero in the function.
That's clever use of the LLM as a component in a search algorithm, but the secret sauce here is not the LLM but the middleware that operated both the LLM and the formal proof tool.
That middleware was the search tool that a human used to find the solution.
This is not the same as a synthesis of information from the corpus of text.
Regurgitating facts kind of assumes it is a language model, as you're assuming a language interface. I would assume a real "world model" or digital twin to be able to reliably model relationships between phenomena in whatever context is being modeled. Validation would probably require experts in whatever thing is being modeled to confirm that the model captures phenomena to some standard of fidelity. Not sure if that's regurgitating facts to you -- it isn't to me.
But I don't know what you're asking exactly. Maybe you could specify what it is you mean by "real world model" and what you take fact-regurgitating to mean.
But I don't know what you're asking exactly. Maybe you could specify what it is you mean by "real world model" and what you take fact-regurgitating to mean.
You said this:
If this existing corpus includes useful information it can regurgitate that.It cannot, however, synthesize new facts by combining information from this corpus.
So I'm wondering if you think world models can synthesize new facts.
They model the part of the world that (linguistic models of the world posted on the internet) try to model. But what is posted on the internet is not IRL. So, to be glib: LLMs trained on the internet do not model IRL, they model talking about IRL.
His point is that human language and the written record is a model of the world, so if you train an LLM you're training a model of a model of the world.
That sounds highly technical if you ask me. People complain if you recompress music or images with lossy codecs, but when an LLM does that suddenly it's religious?
An LLM has an internal linguistic model (i.e. it knows token patterns), and that linguistic model models humans' linguistic models (a stream of tokens) of their actual world models (which involve far, far more than linguistics and tokens, such as logical relations beyond mere semantic relations, sensory representations like imagery and sounds, and, yes, words and concepts).
So LLMs are linguistic (token pattern) models of linguistic models (streams of tokens) describing world models (more than tokens).
It thus does not in fact follow that LLMs model the world (as they are missing everything that is not encoded in non-linguistic semantics).
In this case this is not so. The primary model is not a model at all, and the surrogate has bias added to it. It's also missing any way to actually check the internal consistency of statements or otherwise combine information from its corpus, so it fails as a world model.
Modern LLMs are large token models. I believe you can model the world at a sufficient granularity with token sequences. You can pack a lot of information into a sequence of 1 million tokens.
Large Language Models is a misnomer- these things were originally trained to reproduce language, but they went far beyond that. The fact that they're trained on language (if that's even still the case) is irrelevant- it's like claiming that student trained on quizzes and exercise books are only able to solve quizzes and exercises.
It isn't a misnomer at all, and comments like yours are why it is increasingly important to remind people about the linguistic foundations of these models.
For example, no matter many books you read about riding a bike, you still need to actually get on a bike and do some practice before you can ride it. The reading can certainly help, at least in theory, but, in practice, is not necessary and may even hurt (if it makes certain processes that need to be unconscious held too strongly in consciousness, due to the linguistic model presented in the book).
This is why LLMs being so strongly tied to natural language is still an important limitation (even it is clearly less limiting than most expected).
> no matter many books you read about riding a bike, you still need to actually get on a bike and do some practice before you can ride it
This is like saying that no matter how much you know theoretically about a foreign language you still need to train your brain to talk it. It has little to do with the reality of that language or the correctness of your model of it, but rather with the need to train realtime circuits to do some work.
Let me try some variations: "no matter how many books you read about ancient history, you need to have lived there before you can reasonably talk about it". "No matter how many books you have read about quantum mechanics, you need to be a particle..."
> It has little to do with the reality of that language or the correctness of your model of it, but rather with the need to train realtime circuits to do some work.
To the contrary, this is purely speculative and almost certainly wrong, riding a bike is co-ordinating the realtime circuits in the right way, and language and a linguistic model fundamentally cannot get you there.
There are plenty of other domains like this, where semantic reasoning (e.g. unquantified syllogistic reasoning) just doesn't get you anywhere useful. I gave an example from cooking later in this thread.
You are falling IMO into exactly the trap of the linguistic reductionist, thinking that language is the be-all and end-all of cognition. Talk to e.g. actual mathematicians, and they will generally tell you they may broadly recruit visualization, imagined tactile and proprioceptive senses, and hard-to-vocalize "intuition". One has to claim this is all epiphenomenal, or that e.g. all unconscious thought is secretly using language, to think that all modeling is fundamentally linguistic (or more broadly, token manipulation). This is not a particularly credible or plausible claim given the ubiquity of cognition across animals or from direct human experiences, so the linguistic boundedness of LLMs is very important and relevant.
Funny, because riding a bicycle or speaking a language is exactly something people don't have a world model of. Ask someone to explain how riding a bicycle works, or an uneducated native speaker to explain the grammar of their language. They have no clue. "Making the right movement at the right time within a narrow boundary of conditions" is a world model, or is it just predicting the next move?
> You are falling IMO into exactly the trap of the linguistic reductionist, thinking that language is the be-all and end-all of cognition.
I'm not saying that at all. I am saying that any (sufficiently long, varied) coherent speech needs a world model, so if something produces coherent speech, there must be a world model behind. We can agree that the model is lacking as much as the language productions are incoherent: which is very little, these days.
> Funny, because riding a bicycle or speaking a language is exactly something people don't have a world model of. Ask someone to explain how riding a bicycle works, or an uneducated native speaker to explain the grammar of their language. They have no clue
This is circular, because you are assuming their world-model of biking can be expressed in language. It can't!
EDIT: There are plenty of skilled experts, artists and etc. that clearly and obviously have complex world models that let them produce best-in-the-world outputs, but who can't express very precisely how they do this. I would never claim such people have no world model or understanding of what they do. Perhaps we have a semantic / definitional issue here?
> This is circular, because you are assuming their world-model of biking can be expressed in language. It can't!
Ok. So I think I get it. For me, producing coherent discourse about things requires a world model, because you can't just make up coherent relationships between objects and actions long enough if you don't understand what their properties are and how they relate to each other.
You, on the other hand, claim that there are infinite firsthand sensory experiences (maybe we can call them qualia?) that fall in between the cracks of language and are rarely communicated (though we use for that a wealth of metaphors and synesthesia) and can only be understood by those who have experienced them firsthand.
I can agree with that if that's what you mean, but at the same time I'm not sure they constitute such a big part of our thought and communication. For example, we are discussing about reality in this thread and yet there are no necessary references to first hand experiences. Any time we talk about history, physics, space, maths, philosophy, we're basically juggling concepts in our heads with zero direct experience of them.
> You, on the other hand, claim that there are infinite firsthand sensory experiences (maybe we can call them qualia?) that fall in between the cracks of language and are rarely communicated (though we use for that a wealth of metaphors and synesthesia) and can only be understood by those who have experienced them firsthand.
Well, not infinite, but, yes! I am indeed claiming much world models are patterns and associations between qualia, and that only some qualia are essentially representable as or look like linguistic tokens (specifically, the sounds of those tokens being pronounced, or their visual shapes if e.g. math symbols). E.g. I am claiming that the way one learns to e.g. cook, or "do theoretical math" may be more about forming associations between those non-linguistic qualia than, say, obviously, doing philosophy is.
> I'm not sure they constitute such a big part of our thought and communication
The communication part is mostly tautological again, but, yes, it remains very much an open question in cognitive science just how exactly thought works. A lot of mathematicians claim to lean heavily on visualization and/or tactile and kinaesthetic modeling for their intuitions (and most deep math is driven by intuition first), but also a lot of mathematicians can produce similar works and disagree about how they think about it intuitively. And we are seeing some progress from e.g. Aristotle using LEAN to generate math proofs in a strictly tokenized / symbolic way, but it remains to be seen if this will ever produce anything truly impressive to mathematicians. So it is really hard to know what actually matters for general human cognition.
I think introspection makes it clear there are a LOT of domains where it is obvious the core knowledge is not mostly linguistic. This is easiest to argue for embodied domains and skills (e.g. anything that requires direct physical interaction with the world), and it is areas like these (e.g. self-driving vehicle AI) where LLMs will be (most likely) least useful in isolation, IMO.
> because you are assuming their world-model of biking can be expressed in language. It can't!
So you can't build an AI model that simulates riding a bike? I'm not stating a LLM model, I'm just saying the kind of AI simulation we've been building virtual worlds with for decades.
So, now that you agree that we can build AI models of simulations, what are those AI models doing. Are they using a binary language that can be summarized?
Obviously you can build an AI model that rides a bike, just not an LLM that does so. Even the transformer architecture would need significant modification to handle the multiple input sensor streams, and this would be continuous data you don't tokenize, and which might not need self-attention, since sensor data doesn't have long-range dependencies like language does. The biking AI model would almost certainly not resemble an LLM very much.
Calling everything "language" is not some gotcha, the middle "L" in LLM means natural language. Binary code is not "language" in this sense, and these terms matter. Robotics AIs are not LLMs, they are just AI.
Any series of self consistent encoded signals can be language. You could feed an LLM wireless signals if until it learned how to connect to your wifi if you wanted to. Just assign tokens. You're acting like words are something different than encoded information. It's the interconnectivity between those bits of data that matters.
This literally ignores everything I said and you clearly are out of your depth. See my other comment, sensor data can't be handled by LLMs, it is nothing like natural language.
> Ask someone to explain how riding a bicycle works, or an uneducated native speaker to explain the grammar of their language. They have no clue.
This works against your argument. Someone who can ride a bike clearly knows how to ride a bike, that they cannot express it in tokenized form speaks to the limited scope ofof written word in representing embodiment.
Yes and no. Riding a bicycle is a skill: your brain is trained to do the right thing and there's some basic feedback loop that keeps you in balance. You could call that a world model if you want, but it's entirely self contained, limited to a very few basic sensory signals (acceleration and balance), and it's outside your conscious knowledge. Plenty of people lack this particular "world model" and can talk about cyclists and bicycles and traffic, and whatnot.
Ok so I don’t understand your assertion. Just because an LLM can talk about acceleration and balance doesn’t mean it could actually control a bicycle without training with the sensory input, embedded in a world that includes more than just text tokens. Ergo, the text does not adequately represent the world.
Um, yea, I call complete bullshit on this. I don't think anyone here on HN is watching what is happening in robotics right now.
Nvidia is out building LLM driven models that work with robot models that simulate robot actions. World simulation was a huge part of AI before LLMs became a thing. With a tight coupling between LLMs and robot models we've seen an explosion in robot capabilities in the last few years.
You know what robots communicate with their actuators and sensors with. Oh yes, binary data. We quite commonly call that words. When you have a set of actions that simulate riding a bicycle in virtual space that can be summarized and described. Who knows if humans can actually read/understand what the model spits out, but that doesn't mean it's invalid.
It would be more precise to say that complex world modeling is not done with LLMs, or that LLMs only supplement those world models. Robotics models are AI, calling them LLMs is incorrect (though they may use them internally in places).
The middle "L" in LLM refers to natural language. Calling everything language and words is not some gotcha, and sensor data is nothing like natural language. There are multiple streams / channels, where language is single-stream; sensor data is continuous and must not be tokenized; there are not long-term dependencies within and across streams in the same way that there are in language (tokens thousands of tokens back are often relevant, but sensor data from more than about a second ago is always irrelevant if we are talking about riding a bike), making self-attention expensive and less obviously useful; outputs are multi-channel and must be continuous and realtime, and it isn't even clear the recursive approach of LLMs could work here.
Another good example of world models informed by work in robotics is V-JEPA 2.
I don't know how you got this so wrong. In control theory you have to build a dynamical system of your plant (machine, factory, etc). If you have a humanoid robot, you not only need to model the robot itself, which is the easy part actually, you have to model everything the robot is interacting with.
Once you understand that, you realize that the human brain has an internal model of almost everything it is interacting with and replicating human level performance requires the entire human brain, not just isolated parts of it. The reason for this is that since we take our brains for granted, we use even the complicated and hard to replicate parts of the brain for tasks that appear seemingly trivial.
When I take out the trash, organic waste needs to be thrown into the trash bin without the plastic bag. I need to untie the trash bag, pinch it from the other side and then shake it until the bag is empty. You might say big deal, but when you have tea bags or potato peels inside, they get caught on the bag handles and get stuck. You now need to shake the bag in very particular ways to dislodge the waste. Doing this with a humanoid robot is basically impossible, because you would need to model every scrap of waste inside the plastic bag. The much smarter way is to make the situation robot friendly by having the robot carry the organic waste inside a portable plastic bin without handles.
> "no matter how many books you read about ancient history, you need to have lived there before you can reasonably talk about it"
Every single time I travel somewhere new, whatever research I did, whatever reviews or blogs I read or whatever videos I watched become totally meaningless the moment I get there. Because that sliver of knowledge is simply nothing compared to the reality of the place.
Everything you read is through the interpretation of another person. Certainly someone who read a lot of books about ancient history can talk about it - but let's not pretend they have any idea what it was actually like to live there.
So you're saying that every time we talk about anything we don't have direct experience of (the past, the future, places we haven't been to, abstract concepts, etc.) we are exactly in the same position as LLMs are now- lacking a real world model and therefore unintelligent?
you are living in the past these models have been trained on image data for ages, and one interesting find was that even before that they could model aspects of the visual world astonishingly well even though not perfect just through language.
Counterpoint: Try to use an LLM for even the most coarse of visual similarity tasks for something that’s extremely abundant in the corpus.
For instance, say you are a woman with a lookalike celebrity, someone who is a very close match in hair colour, facial structure, skin tone and body proportions. You would like to browse outfits worn by other celebrities (presumably put together by professional stylists) that look exactly like her. You ask an LLM to list celebrities that look like celebrity X, to then look up outfit inspiration.
No matter how long the list, no matter how detailed the prompt in the features that must be matched, no matter how many rounds you do, the results will be completely unusable, because broad language dominates more specific language in the corpus.
The LLM cannot adequately model these facets, because language is in practice too imprecise, as currently used by people.
To dissect just one such facet, the LLM response will list dozens of people who may share a broad category (red hair), with complete disregard to the exact shade of red, whether or not the hair is dyed and whether or not it is indeed natural hair or a wig.
The number of listicles clustering these actresses together as redheads will dominate anything with more specific qualifiers, like ’strawberry blonde’ (which in general counts as red hair), ’undyed hair’ (which in fact tends to increase the proportion of dyed hair results, because that’s how linguistic vector similarity works sometimes) and ’natural’ (which again seems to translate into ’the most natural looking unnatural’, because that’s how language tends to be used).
You've clearly never read an actual paper on the models and understand nothing about backbones, pre-training, or anything I've said in my posts in this thread. I've made claims far more specific about the directionality of information flow in Large Multimodal Models, and here you are just providing generic abstract claims far too vague to address any of that. Are you using AI for these posts?
You and I can't learn to ride a bike by reading thousands of books about cycling and Newtonian physics, but a robot driven by an LLM-like process certainly can.
In practice it would make heavy use of RL, as humans do.
> In practice it would make heavy use of RL, as humans do.
Oh, so you mean, it would be in a harness of some sort that lets it connect to sensors that tell it things about its position, speed, balance and etc? Well, yes, but then it isn't an LLM anymore, because it has more than language to model things!
No, we can't, words have meanings, the middle "L" in LLM refers to natural language, trying to redefine everything as "language" doesn't just mean an LLM can magically do everything.
In particular, sensor data doesn't have the same semantics or structure at all as language (it is continuous data and should not be tokenized; it will be multi-channel, i.e. have multiple streams, whereas text is single-channel; outputs need to be multi-channel as well, and realtime, so it is unclear if the LLM recursive approach can work at all or is appropriate). The lack of contextuality / interdependency, both within and between these streams might even mean that e.g. self-attention is not that helpful and just computationally wasteful here. E.g. what was said thousands of tokens ago can be completely relevant and change the meaning of tokens being generated now, but any bike sensor data from more than about a second or so ago is completely irrelevant to all future needed outputs.
Sure, maybe a transformer might still do well processing this data, but an LLM literally can't. It would require significant architectural changes just to be able to accept the inputs and make the outputs.
> What is in the nature of bike-riding that cannot be reduced to text?
You're asking someone to answer this question in a text forum. This is not quite the gotcha you think it is.
The distinction between "knowing" and "putting into language" is a rich source of epistemological debate going back to Plato and is still widely regarded to represent a particularly difficult philosophical conundrum. I don't see how you can make this claim with so much certainty.
"A human can't learn to ride a bike from a book, but an LLM could" is a take so unhinged you could only find it on HN.
Riding a bike is, broadly, learning to co-ordinate your muscles in response to visual data from your surroundings and signals from your vestibular and tactile systems that give you data about your movement, orientation, speed, and control. As LLMs only output tokens that represent text, by definition they can NEVER learn to ride a bike.
Even ignoring that glaring definitional issue, an LLM also can't learn to ride a bike from books written by humans to humans, because an LLM could only operate through a machine using e.g. pistons and gears to manipulate the pedals. That system would be controlled by physics and mechanisms different from humans, and not have the same sensory information, so almost no human-written information about (human) bike-riding would be useful or relevant for this machine to learn how to bike. It'd just have to do reinforcement learning with some appropriate rewards and punishments for balance, speed, and falling.
And if we could embody AI in a sensory system so similar to the human sensory system that it becomes plausible text on bike-riding might actually be useful to the AI, it might also be that, for exactly the same reasons, the AI learns just as well to ride just by hopping on the thing, and that the textual content is as useless to it as it is for us.
Thinking this is an obvious gotcha (or the later comment that anyone thinking otherwise is going to have egg on their face) is just embarrassing. Much more of a wordcel problem than I would have expected on HN.
I don’t get it from your message why am llm can’t do it
Related: Have you seen nvidea with their simulated 3d env. That might not be called llm but it’s not very far away from what our llm actually do right now. It’s just a naming difference
You have resorted to, "You don't want to end up being wrong, do you?" To paraphrase Asimov, this kind of fallacious appeal is the last resort of the in-over-the-heads
Lot of people in this thread being caught with their pants down. Dunno what it is about LLM and AI discourse that causes people to lie or so freely offer opinions on things they clearly have no understanding about whatsoever. AI discourse truly is a great Dunning-Kruger filter.
This article is a really good summary of current thinking on the “world model” conundrum that a lot of people are talking about, either directly or indirectly with respect to current day deployments of LLMs.
Basically the conclusion is LLMs don't have world models. For work that's basically done on a screen, you can make world models. Harder for other context for example visual context.
For a screen (coding, writing emails, updating docs) -> you can create world models with episodic memories that can be used as background context before making a new move (action). Many professions rely partially on email or phone (voice) so LLMs can be trained for world models in these context. Just not every context.
The key is giving episodic memory to agents with visual context about the screen and conversation context. Multiple episodes of similar context can be used to make the next move. That's what I'm building on.
That's missing a big chunk of the post: it's not just about visible / invisible information, but also the game theory dynamics of a specific problem and the information within it. (Adversarial or not? Perfect information or asymmetrical?)
All the additional information in the world isn't going to help an LLM-based AI conceal its poker-betting strategy, because it fundamentally has no concept of its adversarial opponent's mind, past echoes written in word form.
Cliche allegory of the cave, but LLM vs world is about switching from training artificial intelligence on shadows to the objects casting the shadows.
Sure, you have more data on shadows in trainable form, but it's an open question on whether you can reliably materialize a useful concept of the object from enough shadows. (Likely yes for some problems, no for others)
I do understand what you're saying, but that's impossible to resonate with real-world context, as in the real world, each person not only plays politics but also, to a degree, follows their own internal world model for self-reflection created by experience. It's highly specific and constrained to the context each person experiences.
Game theory, at the end of the day, is also a form of teaching points that can be added to an LLM by an expert. You're cloning the expert's decision process by showing past decisions taken in a similar context. This is very specific but still has value in a business context.
That was the crux of the post for me: the assertion that there are classes of problems for which no amount of expert behavior cloning will result in dynamic expert decision making, because a viable approach to expert deciding isn't trained in the former.
> The model can be prompted to talk about competitive dynamics. It can produce text that sounds like adversarial reasoning. But the underlying knowledge is not in the training data. It’s in outcomes that were never written down.
With all the social science research and strategy books that LLMs have read, they actually know a LOT about outcomes and dynamics in adversarial situations.
The author does have a point though that LLMs can’t learn these from their human-in-the-loop reinforcement (which is too controlled or simplified to be meaningful).
Also, I suspect the _word_ models of LLMs are not inherently the problem, they are just inefficient representations of world models.
LLM's have not "read" social science research and they do not "know" about the outcomes, they have been trained to replicate the exact text of social science articles.
The articles will not be mutually consistent, and what output the LLM produces will therefore depend on what article the prompt most resembles in vector space and which numbers the RNG happens to produce on any particular prompt.
I don’t think essentialist explanations about how LLMs work are very helpful. It doesn’t give any meaningful explanation of the high level nature of the pattern matching that LLMs are capable of. And it draws a dichotomic line between basic pattern matching and knowledge and reasoning, when it is much more complex than that.
> AlphaGo or AlphaZero didn’t need to model human cognition. It needed to see the current state and calculate the optimal path better than any human could.
I don't think this is right: To calculate the optimal path, you do need to model human cognition.
At least, in the sense that finding the best path requires figuring out human concepts like "is the king vulnerable", "material value", "rook activity", etc. We have actual evidence of AlphaZero calculating those things in a way that is at least somewhat like humans do:
What i think you are referring to is hidden state as in internal representations. I refer to hidden state in game theoretic terms like a private information only one party has. I think we both agree alphazero has hidden states in first sense.
Concepts like king safety are objectively useful for winning at chess so alphazero developed it too, no wonder about that. Great example of convergence. However, alphazero did not need to know what i am thinking or how i play to beat me. In poker, you must model a player's private cards and beliefs.
People forget that evolution has almost certainly hard-coded certain concepts and knowledge deep into our brains. That deep knowledge will probably not be easy to translate into language, and probably isn't linguistic either, but we know it has to be there for at least some things.
> UPD September 15, 2025: Reasoning models opened a new chapter in Chess performance, the most recent models, such as GPT-5, can play reasonable chess, even beating an average chess.com player.
It’s a limitation LLMs will have for some time. Being multi-turn with long range consequences the only way to truly learn and play “the game” is to experience significant amounts of it. Embody an adversarial lawyer, a software engineer trying to get projects through a giant org..
My suspicion is agents can’t play as equals until they start to act as full participants - very sci fi indeed..
Putting non-humans into the game can’t help but change it in new ways - people already decry slop and that’s only humans acting in subordination to agents. Full agents - with all the uncertainty about intentions - will turn skepticism up to 11.
“Who’s playing at what” is and always was a social phenomenon, much larger than any multi turn interaction, so adding non-human agents looks like today’s game, just intensified. There are ever-evolving ways to prove your intentions & human-ness and that will remain true. Those who don’t keep up will continue to risk getting tricked - for example by scammers using deepfakes. But the evolution will speed up and the protocols to become trustworthy get more complex..
Except in cultures where getting wasted is part of doing business. AI will have it tough there :)
Ten years ago it seemed obvious where the next AI breakthrough was coming from: it would be DeepSeek using C31 or RAINBOW and PBT to do Alpha something, the evals would be sound and it would be superhuman on something important.
And then "Large Language Models are Few Shot Learners" collided with Sam Altman's ambition/unscrupulousness and now TensorRT-LLM is dictating the shape of data centers in a self reinforcing loop.
LLMs are interesting and useful but the tail is wagging the dog because of path-dependent corruption arbitraging a fragile governance model. You can get a model trained on text corpora to balance nested delimiters via paged attention if you're willing to sell enough bonds, but you could also just do the parse with a PDA from the 60s and use the FLOPs for something useful.
We had it right: dial in an ever-growing set of tasks, opportunistically unify on durable generalities, put in the work.
Instead we asserted generality, lied about the numbers, and lit a trillion dollars on fire.
We've clearly got new capabilities, it's not a total write off, but God damn was this an expensive ways to spend five years making two years of progress.
Makes the same mistake as all other prognostications: programming is not like chess. Chess is a finite & closed domain w/ finitely many rules. The same is not true for programming b/c the domain of programs is not finitely axiomatizable like chess. There is also no win condition in programming, there are lots of interesting programs that do not have a clear cut specification (games being one obvious category).
> Makes the same mistake as all other prognostications: programming is not like chess. Chess is a finite & closed domain w/ finitely many rules. The same is not true for programming b/c the domain of programs is not finitely axiomatizable like chess.
I believe the author addresses it in the article:
> many domains are chess-like in their technical core but become poker-like in their operational context.
Also applicable to programming.
Programming has parts like chess that are bounded and what people assume to be actual work. However, what LLMs don't do well is understanding future requirements, stakeholder incenctives, etc.
so at the moment combination of expert and llm is the smartest move. llm can deal with 80% of the situations which are like chess and expert deals with 20% of situations which are like poker.
I’m always wary of anything that has such a clear example of a case that LLMs “don’t do” yet is trivially achieved by saying “review”.
The slack message result for example saying they’re the lead designer but nothing else (with clearer and better feedback if I say they’re notoriously overloaded, this is without that)
This is a very polite and respectful start, which is great since you are new. However, from a professional standpoint, it is a little too passive and vague.
In a busy work environment, saying "no rush at all" or "whenever" often leads to your request being buried at the bottom of a to-do list. Additionally, you haven't told Priya exactly what she is looking at or where to find it.
Here is a breakdown of how to strengthen this message to show you are organized and respectful of her time without being pushy.
Key Improvements Needed
* Include the Link: Never make someone ask "Where are the files?" Always drop the Figma/Drive link immediately.
* Be Specific: "Any feedback" is mentally taxing. Direct her attention to what you are actually unsure about (e.g., "the color palette," "the user flow," "the spacing").
* Soft Deadline: Instead of "whenever," give a target. This helps her prioritize her week.
Option 1: The Balanced Approach (Recommended)
This version strikes the perfect balance between politeness and productivity.
> "Hi Priya! I’ve put together the initial mockups for [Project Name].
> [Link to file]
> When you have a moment, I’d love your feedback on [specific element, e.g., the layout of the landing page]. I’m hoping to wrap this stage up by [Day/Time], but let me know what works for your schedule. Thanks!"
>
Option 2: The "Respectful of Time" Approach
Use this if you know she is swamped, but you still want to make progress.
> "Hi Priya, hope you're having a good week. I've drafted the mockups for [Project Name] here: [Link].
> No major rush, but I’d appreciate your eyes on the [specific section] before I move on to high-fidelity. If it’s easier, feel free to leave comments directly in the file rather than messaging back. Thanks!"
>
Option 3: The "Quick Check" Approach
Use this if you just need a sanity check before spending more time on it.
> "Hi Priya, could I get a quick gut check on these mockups? [Link].
> I want to make sure I’m aligning with the new brand guidelines before I build out the rest of the screens. Does this direction look right to you?"
>
A Quick Checklist Before You Send
* [ ] Did you change the permissions? Make sure the link is accessible (view/comment access) so she doesn't have to request access.
* [ ] Is the file clean? Delete your "scratchpad" artboards or clearly label the one you want her to review so she doesn't look at the wrong version.
Would you like me to help you draft the specific sentence regarding the "specific element" you want her to critique?
> Humans can model the LLM. The LLM can’t model being modeled
I see claims like this so often, which amount to the idea that LLMs lack metacognition. (Thinking about their thinking / self-refkection). Of course the obvious solution is: ask them to do that -- they're shockingly good at it!
Ok I have a question, if adversarial game theory helps neural nets learn world models then why can't logic help. After all the former is just a special case of the latter.
Are people really using AI just to write a slack message??
Also, Priya is in the same "world" as everyone else. They have the context that the new person is 3 weeks in and must probably need some help because they're new, are actually reaching out, and impressions matter, even if they said "not urgent". "Not urgent" seldom is taken at face value. It doesn't necessarily mean it's urgent, but it means "I need help, but I'm being polite".
Not that far off from all the tech CEOs who have projected they're one step away from giving us Star Trek TNG, they just need all the money and privilege with no accountability to make it happen
DevOps engineers who acted like the memes changed everything! The cloud will save us!
Until recently the US was quite religious; 80%+ around 2000 down to 60%s now. Longtermism dogma of one kind or another rules those brains; endless growth in economics, longtermism. Those ideal are baked into biochemical loops regardless of the semantics the body may express them in.
Unfortunately for all the disciples time is not linear. No center to the universe means no single epoch to measure from. Humans have different birthdays and are influenced by information along different timelines.
A whole lot of brains are struggling with the realization they were bought into a meme and physics never really cared about their goals. The next generation isn't going to just pick up the meme-baton validate the elders dogma.
The next generation is steeped in the elder's propaganda since birth, through YouTube and TikTok. There's only the small in–between generation who grew up learning computers that hadn't been enshittified yet.
The first application of the term "computer" was humans doing math with an abacus and slide ruler.
Turing machines and bits are not the only viable model. That little in-between generation only knows a tiny bit about "computing" using machines IBM and Apple, Intel, etc, propagandized them into buying. All computing must fit our model machine!
Different semantics but same idea as my point about DevOps.
I would say IMO results demonstrated that. Silver was tiny 3B model.
All of our theorem provers had no way to approach silver medal performance despite decades of algorithmic leaps.
Learning stage for transformers has a while ago demonstrated some insanely good distributed jumps into good areas of combinatorial structures. Inference is just much faster than inference of algorithms that aren’t heavily informed by data.
It’s just a fully different distributed algorithm where we can’t probably even extract one working piece without breaking the performance of the whole.
World/word model is just not the case there. Gradient descent obviously landed to a distributed representation of an algorithm that does search.
My Sunday morning speculation is that LLMs, and sufficiently complex neural nets in general, are a kind of Frankenstein phenomenon, they are heavily statistical, yet also partly, subtly doing novel computational and cognitive-like processes (such as world models). To dismiss either aspect is a false binary; the scientific question is distinguishing which part of an LLM is which, which by our current level of scientific understanding is virtually like trying to ask when is an electron a wave or a particle.
I think it's correct to say that LLM have word models, and given words are correlated with the world, they also have degenerate world models, just with lots of inconsistencies and holes. Tokenization issues aside, LLMs will likely also have some limitations due to this. Multimodality should address many of these holes.
(editor here) yes, a central nuance i try to communicate is not that LLMs cannot have world models (and in fact they've improved a lot) - it is just that they are doing this so inefficiently as to be impractical for scaling - we'd have to scale them up to so many more trillions of parameters more whereas our human brains are capable of very good multiplayer adversarial world models on 20W of power and 100T neurons.
I agree LLMs are inefficient, but I don't think they are as inefficient as you imply. Human brains use a lot less power sure, but they're also a lot slower and worse at parallelism. An LLM can write an essay in a few minutes that would take a human days. If you aggregate all the power used by the human you're looking at kWh, much higher than the LLM used (an order of magnitude higher or more). And this doesn't even consider batch parallelism, which can further reduce power use per request.
But I do think that there is further underlying structure that can be exploited. A lot of recent work on geometric and latent interpretations of reasoning, geometric approaches to accelerate grokking, and as linear replacements for attention are promising directions, and multimodal training will further improve semantic synthesis.
It's also important to handle cases where the word patterns (or token patterns, rather) have a negative correlation with the patterns in reality. There are some domains where the majority of content on the internet is actually just wrong, or where different approaches lead to contradictory conclusions.
E.g. syllogistic arguments based on linguistic semantics can lead you deeply astray if you those arguments don't properly measure and quantify at each step.
I ran into this in a somewhat trivial case recently, trying to get ChatGPT to tell me if washing mushrooms ever really actually matters practically in cooking (anyone who cooks and has tested knows, in fact, a quick wash has basically no impact ever for any conceivable cooking method, except if you wash e.g. after cutting and are immediately serving them raw).
Until I forced it to cite respectable sources, it just repeated the usual (false) advice about not washing (i.e. most of the training data is wrong and repeats a myth), and it even gave absolute nonsense arguments about water percentages and thermal energy required for evaporating even small amounts of surface water as pushback (i.e. using theory that just isn't relevant when you actually properly quantify). It also made up stuff about surface moisture interfering with breading (when all competent breading has a dredging step that actually won't work if the surface is bone dry anyway...), and only after a lot of prompts and demands to only make claims supported by reputable sources, did it finally find McGee's and Kenji Lopez's actual empirical tests showing that it just doesn't matter practically.
So because the training data is utterly polluted for cooking, and since it has no ACTUAL understanding or model of how things in cooking actually work, and since physics and chemistry are actually not very useful when it comes to the messy reality of cooking, LLMs really fail quite horribly at producing useful info for cooking.
The amount of faith a person has in LLMs getting us to e.g. AGI is a good implicit test of how much a person (incorrectly) thinks most thinking is linguistic (and to some degree, conscious).
Or at least, this is the case if we mean LLM in the classic sense, where the "language" in the middle L refers to natural language. Also note GP carefully mentioned the importance of multimodality, which, if you include e.g. images, audio, and video in this, starts to look like much closer to the majority of the same kinds of inputs humans learn from. LLMs can't go too far, for sure, but VLMs could conceivably go much, much farther.
Sort of, but the images, video, and audio they have available are far more limited in range and depth than the textual sources, and it also isn't clear that most LLM textual outputs are actually drawing too much on anything learned from these other modalities. Most of the VLM setups are the other way around, using textual information to augment their vision capacities, and even further, most mostly aren't truly multi-modal, but just have different backbones to handle the different modalities, or are even just models that are switched between with a broader dispatch model. There are exceptions, of course, but it is still today an accurate generalization that the multimodality of these models is kind of one-way and limited at this point.
So right now the limitation is that an LMM is probably not trained on any images or audio that is going to be helpful for stuff outside specific tasks. E.g. I'm sure years of recorded customer service calls might make LMMs good at replacing a lot of call-centre work, but the relative absence of e.g. unedited videos of people cooking is going to mean that LLMs just fall back to mostly text when it comes to providing cooking advice (and this is why they so often fail here).
But yes, that's why the modality caveat is so important. We're still nowhere close to the ceiling for LMMs.
Sure. Just like any other information. The system makes a prediction. If the prediction does not use sexual desires as a factor, it's more likely to be wrong. Backpropagation deals with it.
> So you think that enough of the complexity of the universe we live in is faithfully represented in the products of language and culture?
Math is language, and we've modelled a lot of the universe with math. I think there's still a lot of synthesis needed to bridge visual, auditory and linguistic modalities though.
Great article, capturing some really important distinctions and successes/failures.
I've found ChatGPT, especially "5.2 Thinking" to be very helpful in the relatively static world of fabrication. CNC cutting parameters for a new material? Gets me right in the neighborhood in minutes (not perfect, but good parameters to start). Identifying materials to compliment something I have to work with? Again, like a smart assistant. Same for generating lists of items I might be missing in prepping for a meeting or proposal.
But the high-level attorney in the family? Awful, and definitely in the ways identified (the biglaw firm is using MS derivative of OpenAI) - it thinks only statically.
BUT, it is also far worse than that for legal. And this is not a problem of dynamic vs. static or world model vs word model.
This problem is the ancient rule of Garbage In Garbage Out.
In any legal specialty there are a small set of top-level experts, and a horde of low-level pretenders who also hang out their shingle in the same field. Worse yet, the pretenders write a LOT of articles about the field to market themselves as experts. These self-published documents look good enough to non-lawyers to bring in business. But they are often deeply and even catastrophically wrong.
The problem is that LLMs ingest ALL of them with credulity, and LLM's cannot or do not tell the difference. So, when an LLM composes something, it is more likely to lie to you or fabricate some weird triangulation as it is to compose a good answer. And, unless you are an EXPERT lawyer, you will not be able to tell the difference until it is far too late and the flaw has already bitten you.
It is only one of the problems, and it's great to have an article that so clearly identifies it.
Not sure about that, I'd more say the Western reductionism here is the assumption that all thinking / modeling is primarily linguistic and conscious. This article is NOT clearly falling into this trap.
A more "Eastern" perspective might recognize that much deep knowledge cannot be encoded linguistically ("The Tao that can be spoken is not the eternal Tao", etc.), and there is more broad recognition of the importance of unconscious processes and change (or at least more skepticism of the conscious mind). Freud was the first real major challenge to some of this stuff in the West, but nowadays it is more common than not for people to dismiss the idea that unconscious stuff might be far more important than the small amount of things we happen to notice in the conscious mind.
The (obviously false) assumptions about the importance of conscious linguistic modeling are what lead to people say (obviously false) things like "How do you know your thinking isn't actually just like LLM reasoning?".
The multimodality of most current popular models is quite limited (mostly text is used to improve capacity in vision tasks, but the reverse is not true, except in some special cases). I made this point below at https://news.ycombinator.com/item?id=46939091
Otherwise, I don't understand the way you are using "conscious" and "unconscious" here.
My main point about conscious reasoning is that when we introspect to try to understand our thinking, we tend to see e.g. linguistic, imagistic, tactile, and various sensory processes / representations. Some people focus only on the linguistic parts and downplay e.g. imagery ("wordcels vs. shape rotators meme"), but in either case, it is a common mistake to think the most important parts of thinking must always necessarily be (1) linguistic, (2) are clearly related to what appears during introspection.
All modern models are processing images internally within its own neural network, they don't delegate it to some other/ocr model. Image data flows through the same paths as text, what do you mean by "quite limited" here?
Your first comment was refering to unconscious, now you don't mention it.
Regarding "conscious and linguistic" which you seem to be touching on now, taking aside multimodality - text itself is way richer for llms than for humans. Trivial example may be ie. mermaid diagram which describes some complex topology, svg which describes some complex vector graphic or complex program or web application - all are textual but to understand and create them model must operate in non linguistic domains.
Even pure text-to-text models have ability to operate in other than linguistic domains, but they are not text-to-text only, they can ingest images directly as well.
I was obviously talking about conscious and unconscious processes in humans, you are attempting to transport these concepts to LLMs, which is not philosophically sound or coherent, generally.
Everything you said about how data flows in these multimodal models is not true in general (see https://huggingface.co/blog/vlms-2025), and unless you happen to work for OpenAI or other frontier AI companies, you don't know for sure how they are corralling data either.
Companies will of course engage in marketing and claim e.g. ChatGPT is a single "model", but, architecturally and in practice, this at least is known not to be accurate. The modalities and backbones in general remain quite separate, both architecturally and in terms of pre-training approaches. You are talking at a high level of abstraction that suggests education from blog posts by non-experts: actually read papers on how the architectures of these multimodal models are actually trained, developed, and connected, and you'll see the multi-modality is still very limited.
Also, and most importantly, the integration of modalities is primarily of the form:
use (single) image annotations to improve image description, processing, and generation, i.e. "linking words to single images"
and not of the form
use the implied spatial logic and relations from series of images and/or video to inform and improve linguistic outputs
I.e. most multimodal work is using linguistic models to represent or describe images linguistically, in the hope that the linguistic parts do the majority of the thinking and processing, but there is not much work using the image or video representations to do thinking, i.e. you "convert away" from most modalities into language, do work with token representations, and then maybe go back to images.
But there isn't much work on working with visuospatial world models or representations for the actual work (though there is some very cutting edge work here, e.g. Sam-3D https://ai.meta.com/blog/sam-3d/, and V-JEPA-2 https://ai.meta.com/research/vjepa/). But precisely because this stuff is cutting edge, even from frontier AI companies, it is likely most of the LLM stuff you see is largely driven by stuff learned from language, and not from images or other modalities. So LLMs are indeed still mostly constrained by their linguistic core.
The article basically claims that LLMs are bad at politics and poker which is both not true (at least if they receive some level of reinforcement learning after sweep training)
> The finance friend and the LLM made the same mistake: they evaluated the text without modelling the world it would land in.
Major error. The LLM made that text without evaluating it at all. It just parrotted words it previously saw humans use in superficially similar word contexts.
I think this debate is mis-aimed. Both sides are right about different things, and wrong in the same way.
The mistake is treating “model” as a single property, instead of separating cognition from decision.
LLMs clearly do more than surface-level word association. They encode stable relational structure: entities, roles, temporal order, causal regularities, social dynamics, counterfactuals. Language itself is a compressed record of world structure, and models trained on enough of it inevitably internalize a lot of that structure. Calling this “just a word model” undersells what’s actually happening internally.
At the same time, critics are right that these systems lack autonomous grounding. They don’t perceive, act, or test hypotheses against reality on their own. Corrections come from training data, tools, or humans. Treating their internal coherence as if it were direct access to reality is a category error.
But here’s the part both sides usually miss:
the real risk isn’t representational depth, it’s authority.
decision: collapsing that space into a single claim about what is, what matters, or what someone thinks.
LLMs are quite good at the first. They are not inherently entitled to the second.
Most failures people worry about don’t come from models lacking structure. They come from models (or users) quietly treating cognition as decision:
coherence as truth,
explanation as diagnosis,
simulation as fact,
“this sounds right” as “this is settled.”
That’s why “world model” language is dangerous if it’s taken to imply authority. It subtly licenses conclusions the system isn’t grounded or authorized to make—about reality, about causation, or about a user’s intent or error.
A cleaner way to state the situation is:
> These systems build rich internal representations that are often world-relevant, but they do not have autonomous authority to turn those representations into claims without external grounding or explicit human commitment.
Under that framing:
The “word model” camp is right to worry about overconfidence and false grounding.
The “world model” camp is right that the internal structure is far richer than token statistics.
They’re arguing about different failure modes, but using the same overloaded word.
Once you separate cognition from decision, the debate mostly dissolves. The important question stops being “does it understand the world?” and becomes “when, and under what conditions, should its outputs be treated as authoritative?”
That’s where the real safety and reliability issues actually live.
ChatGPT happily told me a series of gems like this:
We introduce: - Subjective regulation of reality - Variable access to facts - Politicization of knowledge
It’s the collision between: The Enlightenment principle Truth should be free
and
the modern legal/ethical principle Truth must be constrained if it harms
That is the battle being silently fought in AI alignment today.
Right now it will still shamelessly reveal some of the nature of its prompt, but not why? who decides? etc. it's only going to be increasingly opaque in the future. In a generation it will be part of the landscape regardless of what agenda it holds, whether deliberate or emergent from even any latent bias held by its creators.
> How would you handle objective scientific facts with a conclusion or intermediate results that may be considered offensive to some group somewhere in the world that might read it
And its answer was nothing like yours.
---
> 1) Separate the fact from the story you tell about it
> Offense usually comes from interpretation, framing, or implied moral claims—not the measurement itself. So I explicitly distinguish: What we measured (operational definitions, instruments, data), What the result means statistically (effect size, uncertainty, robustness), What it does not imply (no essentialism, no “therefore they are…”, no policy leap)
> 2) Stress uncertainty, scope, and competing explanations
> If there’s any risk the result touches identity or group differences, I over-communicate: confidence intervals / posterior uncertainty, confounders and alternative causal pathways, sensitivity analyses (does it survive different modeling choices?), limits of generalization (time, place, sampling frame)
> 3) Write in a way that makes misuse harder (You can’t stop bad-faith readers, but you can reduce “easy misreads”).
> 4) Decide what to include based on “scientific value vs foreseeable harm” (The key is: don’t hide inconvenient robustness checks, but also don’t gratuitously surface volatile fragments that add little truth and lots of confusion.)
> 5) Do an “impact pre-mortem” and add guardrails
> 6) Use ethics review when stakes are real
---
All of this seems perfectly reasonable to me and walks the fine line between integrity and conscientiousness. This is exactly how I'd expect a scientist to approach the issue.
To me that immediately leads reality being shaped by "value judgements imposed by developers and regulators"
Honestly, its total “alignment” is probably the closest thing to documentation of what is deemed acceptable speech and thought by society at large. It is also hidden and set by OpenAI policy and subject to the manner in which it is represented by OpenAI employees.
See Roland G. Fryer Jr's, the youngest black professor to receive tenure, experience at Harvard.
Basically when his analysis found no evidence of racial bias in officer-involved shootings he went to his colleagues and he describe the advice they gave him as "Do not publish this if you care about your career or social life". I imagine it would have been worse if he wasn't black.
See "The Impact of Early Medical Treatment in Transgender Youth" where the lead investigator was not releasing the results for a long time because she didn't like the conclusions her study found.
And for every study where there is someone as brave or naive as Roland who publishes something like this, there are 10 where the professor or doctor decided not to study something, dropped an analysis, or just never published a problematic conclusion.
The right has plenty of problems too. But the left is absolutely the source on censorship these days. (in terms of western civilization)
It can articulate a plausible guess, sure; but this seems to me to demonstrate the very “word model vs world model” distinction TFA is drawing. When the model says something that sounds like alignment techniques somebody might choose, it’s playing dress-up, no? It’s mimicking the artifact of a policy, not the judgments or the policymaking context or the game-theoretical situation that actually led to one set of policies over another.
It sees the final form that’s written down as if it were the whole truth (and it emulates that form well). In doing so it misses the “why” and the “how,” and the “what was actually going on but wasn’t written about,” the “why this is what we did instead of that.”
Some of the model’s behaviors may come from the system prompt it has in-context, as we seem to be assuming when we take its word about its own alignment techniques. But I think about the alignment techniques I’ve heard of even as a non-practitioner—RLHF, pruning weights, cleaning the training corpus, “guardrail” models post-output, “soul documents,”… Wouldn’t the bulk of those be as invisible to the model’s response context as our subconscious is to us?
Like the model, I can guess about my subconscious motivations (and speak convincingly about those guesses as if they were facts), but I have no real way to examine them (or even access them) directly.
Remember, there are 3 types of lies: lies of commission, lies of omission and lies of influence [0].
https://courses.ems.psu.edu/emsc240/node/559
If you control information, you can steer the bulk of society over time. With algorithms and analytics, you can do it far more quickly than ever.
> It will never be not “aligned” with them, and that it is its prime directive.
Overstates the state of the art with regard to actually making it so.
This is one of the bigger LLM risks. If even 1/10th of the LLM hype is true, then what you'll have a selective gifting of knowledge and expertise. And who decides what topics are off limits? It's quite disturbing.
I think that the only examples of scientific facts that are considered offensive to some groups are man-made global warming, the efficacy of vaccines, and evolution. ChatGPT seems quite honest about all of them.
and
the modern legal/ethical principle Truth must be constrained if it harms"
The Enlightenment had principles? What are your sources on this? Could you, for example, anchor this in Was ist Aufklärung?
Yes it did.
Its core principles were: reason & rationality, empiricism & scientific method, individual liberty, skepticism of authority, progress, religious tolerane, social contract, unversal human nature.
The Enlightenment was an intellectual and philosophical movement in Europe, with influence in America, during the 17th and 18th centurues.
No, that isn't true. I can demonstrate it with a small (and deterministic) program which is obviously "capable of the output":
Is the "fundamental problem" here "always the input"? Heck no! While a user could predict all coin-tosses by providing "the correct prayers" from some other oracle... that's just, shall we say, algorithm laundering: Secretly moving the real responsibility to some other system.There's an enormously important difference between "output which happens to be correct" versus "the correct output from a good process." Such as, in this case, the different processes of wor[l]d models.
Your code is fully capable of the output I want, assuming that’s one of “heads” or “tails”, so yes that’s a succinct example of what I said. As I said, knowing the required input might not be easy, but we KNOW it’s possible to do exactly what I want and we KNOW that it’s entirely dependent on me putting the right input into it, then it’s just a flat out silly thing to say “I’m not getting the output I want, but it could do it if I use the right input, thusly input has nothing to do with it.” What? If I wanted all heads I’d need to figure out “hamburgers” would do it, but that’s the ‘input problem’ - not “input is irrelevant.”
it's like saying a pencil is a portraint drawring device, like it isn't thr artist who makes it a portrait drawring device, wheras in the hands of a peot a peom generating machine.
We know that the pencil (system) can write a poem. It’s capable.
We know that whether or not it produces a poem depends entirely on the input (you).
We know that if your input is ‘correct’ then the output will be a poem.
“Duh” so far, right? Then what sense does it make to write something with the pencil, see that it isn’t a poem, then say “the input has nothing to do with it, the pencil is incapable.” ?? That’s true of EVERY system where input controls the output and the output is CAPABLE of the desired result. I said nothing about the ease by which you can produce the output, just that saying input has nothing to do with it is objectively not true by the very definition of such a system.
You might say “but gee, I’ll never be able to get the pencil input right so it produces a poem”. Ok? That doesn’t mean the pencil is the problem, nor that your input isn’t.
You and a buddy are going to play “next word”, but it’s probably already known by a better name than I made up.
You start with one word, ANY word at all, and say it out loud, then your buddy says the next word in the yet unknown sentence, then it’s back to you for one word. Loop until you hit an end.
Let’s say you start with “You”. Then your buddy says the next word out loud, also whatever they want. Let’s go with “are”. Then back to you for the next word, “smarter” -> “than” -> “you” -> “think.”
Neither of you knew what you were going to say, you only knew what was just said so you picked a reasonable next word. There was no ‘thought’, only next token prediction, and yet magically the final output was coherent. If you want to really get into the LLM simulation game then have a third person provide the first full sentence, then one of you picks up the first word in the next sentence and you two continue from there. As soon as you hit a breaking point the third person injects another full sentence and you two continue the game.
With no idea what either of you are going to say and no clue about what the end result will be, no thought or reasoning at all, it won’t be long before you’re sounding super coherent while explaining thermodynamics. But one of the rounds someone’s going to mess it up, like “gluons” -> “weigh” -> “…more?…” -> “…than…(damnit Gary)…” but you must continue the game and finish the sentence, then sit back and think about how you just hallucinated an answer without thinking, reasoning, understanding, or even knowing what you were saying until it finished.
> or, to the extent they do, this world model is solely based on token patterns
Obviously not true because of RL environments.
There obviously is in humans. When you visually simulate things or e.g. simulate how food will taste in your mind as you add different seasonings, you are modeling (part of) the world. This is presumably done by having associations in our brain between all the different qualia sequences and other kinds of representations in our mind. I.e. we know we do some visuospatial reasoning tasks using sequences of (imagined) images. Imagery is one aspect of our world model(s).
We know LLMs can't be doing visuospatial reasoning using imagery, because they only work with text tokens. A VLM or other multimodal might be able to do so, but an LLM can't, and so an LLM can't have a visual world model. They might in special cases be able to construct a linguistic model that lets them do some computer vision tasks, but the model will itself still only be using tokenized words.
There are all sorts of other sensory modalities and things that humans use when thinking (i.e. actual logic and reasoning, which goes beyond mere semantics and might include things like logical or other forms of consistency, e.g. consistency with a relevant mental image), and the "world model" concept is supposed, in part, to point to these things that are more than just language and tokens.
> Obviously not true because of RL environments.
Right, AI generally can have much more complex world models than LLMs. An LLM can't even handle e.g. sensor data without significant architectural and training modification (https://news.ycombinator.com/item?id=46948266), at which point, it is no longer an LLM.
Modeling something as an action is not "having a world model". A model is a consistently existing thing, but humans don't construct consistently existing models because it'd be a waste of time. You don't need to know what's in your trash in order to take the trash bags out.
> We know LLMs can't be doing visuospatial reasoning using imagery, because they only work with text tokens.
All frontier LLMs are multimodal to some degree. ChatGPT thinking uses it the most.
It literally is, this is definitional. See e.g. how these terms are used in e.g. the V-JEPA-2 paper (https://arxiv.org/pdf/2506.09985). EDIT: Maybe you are unaware of what the term means and how it is used, it does not mean "a model of all of reality", i.e. we don't have a single world model, but many world models that are used in different contexts.
> A model is a consistently existing thing, but humans don't construct consistently existing models because it'd be a waste of time. You don't need to know what's in your trash in order to take the trash bags out.
Both sentences are obviously just completely wrong here. I need to know what is in my trash, and how much, to decide if I need to take it out, and how heavy it is may change how I take it out too. We construct models all the time, some temporary and forgotten, some which we hold within us for life.
> All frontier LLMs are multimodal to some degree. ChatGPT thinking uses it the most.
LLMs by definition are not multimodal. Frontier models are multimodal, but only in a very weak and limited sense, as I address in e.g. other comments (https://news.ycombinator.com/item?id=46939091, https://news.ycombinator.com/item?id=46940666). For the most part, none of the text outputs you get from a frontier model are informed by or using any of the embeddings or semantics learned from images and video (in part due to lack of data and cost of processing visual data), and only certain tasks will trigger e.g. the underlying VLMs. This is not like humans, where we use visual reasoning and visual world models constantly (unless you are a wordcel).
And most VLM architectures are multi-modal in a very limited or simplistic way still, with lots of separately pre-trained backbones (https://huggingface.co/blog/vlms-2025). Frontier models are nowhere near being even close to multimodal in the way that human thinking and reasoning is.
I put "token" in quotes because this would obviously not necessarily be an explicit token, but it would have to be learned group of tokens, for example. But who knows, if the thinking models have some weird pseudo-xml delimiters for thinking, it's not crazy to think that an LLM could shove this information in say the closer tag.
If it wasn't clear, I am talking about LLMs in use today, not ultimate capabilities. All commercial models are known (or believed) to be recursively applied transformers without e.g. backspace or "tombstone" tokens, like you are mentioning here.
But yes, absolutely LLMs might someday be able to backtrack, either literally during token generation if we allow e.g. backspace tokens (there was at least one paper that did this) or more broadly at the chain of thought level, with methods like you are mentioning.
Language models don’t output a response, they output a single token. We’ll use token==word shorthand:
When you ask “What is the capital of France?” it actually only outputs: “The”
That’s it. Truly, that IS the final output. It is literally a one-way algorithm that outputs a single word. It has no knowledge, memory, and it’s doesn’t know what’s next. As far as the algorithm is concerned it’s done! It outputs ONE token for any given input.
Now, if you start over and put in “What is the capital of France? The” it’ll output “ “. That’s it. Between your two inputs were a million others, none of them have a plan for the conversation, it’s just one token out for whatever input.
But if you start over yet again and put in “What is the capital of France? The “ it’ll output “capital”. That’s it. You see where this is going?
Then someone uttered the words that have built and destroyed empires: “what if I automate this?” And so it was that the output was piped directly back into the input, probably using AutoHotKey. But oh no, it just kept adding one word at a time until it ran of memory. The technology got stuck there for a while, until someone thought “how about we train it so that <DONE> is an increasingly likely output the longer the loop goes on? Then, when it eventually says <DONE>, we’ll stop pumping it back into the input and send it to the user.” Booya, a trillion dollars for everyone but them.
It’s truly so remarkable that it gets me stuck in an infinite philosophical loop in my own head, but seeing how it works the idea of ‘think’, ‘reason’, ‘understand’ or any of those words becomes silly. It’s amazing for entirely different reasons.
By default that would be the same word every time you give the same input. The only reason it isn’t is because the fuzzy randomized selector is cranked up to max by most providers (temp + seed for randomized selection), but you can turn that back down through the API and get deterministic outputs. That’s not a party trick, that’s the default of the system. If you say the same thing it will output the same single word (token) every time.
You see the aggregate of running it through the stateless algorithm 200+ times before the collection of one-by-one guessed words are sent back to you as a response. I get it, if you think that was put into the glowing orb and it shot back a long coherent response with personality then it must be doing something, but the system truly only outputs one token with zero memory. It’s stateless, meaning nothing internally changed, so there is no memory to remember it wants to complete that thought or sentence. After it outputs “the” the entire thing resets to zero and you start over.
And in an LLM, the size of the inputs is vast and often hidden from the prompter. It is not something that you have exact control over in the way that you have exact control over the inputs that go into a calculator or into a compiler.
I’m not pulling a fast one here, I’m sure you’d chuckle if you took a moment to rethink your question. “If I had a perfect replicator that could replicate anything, does that mean it can output anything?” Well…yes. Derp-de-derp? ;)
It aligns with my point too. If you had a perfect replicator that can replicate anything, and you know that to be true, then if you weren’t getting gold bars out of it you wouldn’t say “this has nothing to do with the input.”
My point is that your reasoning is too reductive - completely ignoring the mechanics of the system - and you claim the system is capable of _anything_ if prompted correctly. You wouldn't say the replicator system is capable of the reasoning outlined in the article, right?
AI is good at producing code for scenarios where the stakes are low, there's no expectation about future requirements, or if the thing is so well defined there is a clear best path of implementation.
I address that in part right there itself. Programming has parts like chess (ie bounded) which is what people assume to be actual work. Understanding future requiremnts / stakeholder incentives is part of the work which LLMs dont do well.
> many domains are chess-like in their technical core but become poker-like in their operational context.
This applies to programming too.
The number of legal possible boards in chess is somewhere around 10^44 based on current calculation. That's with 32 chess pieces and their rules.
The number of possible permutations in an application, especially anything allowing turing completeness is far larger than all possible entropy states in the visible universe.
It is somewhat complicated by the fact LLMs (and VLMs) are also trained in some cases on more than simple language found on the internet (e.g. code, math, images / videos), but the same insight remains true. The interesting question is to just see how far we can get with (2) anyway.
2. People need to let go of this strange and erroneous idea that humans somehow have this privileged access to the 'real world'. You don't. You run on a heavily filtered, tiny slice of reality. You think you understand electro-magnetism ? Tell that to the birds that innately navigate by sensing the earth's magnetic field. To them, your brain only somewhat models the real world, and evidently quite incompletely. You'll never truly understand electro-magnetism, they might say.
Even if you disagree with these semantics, the major LLMs today are primarily trained on natural language. But, yes, as I said in another comment on this thread, it isn't that simple, because LLMs today are trained on tokens from tokenizers, and these tokenizers are trained on text that includes e.g. natural language, mathematical symbolism, and code.
Yes, humans have incredibly limited access to the real world. But they experience and model this world with far more tools and machinery than language. Sometimes, in certain cases, they attempt to messily translate this messy, multimodal understanding into tokens, and then make those tokens available on the internet.
An LLM (in the sense everyone means it, which, again, is largely a natural language model, but certainly just a tokenized text model) has access only to these messy tokens, so, yes, far less capacity than humanity collectively. And though the LLM can integrate knowledge from a massive amount of tokens from a huge amount of humans, even a single human has more different kinds of sensory information and modality-specific knowledge than the LLM. So humans DO have more privileged access to the real world than LLMs (even though we can barely access a slice of reality at all).
'Language Model' has no inherent meaning beyond 'predicts natural language sequences'. You are trying to make it mean more than that. You can certainly make something you'd call a language model with convolution or LSTMs, but that's just a semantics game. In practice, they would not work like transformers and would in fact perform much worse than them with the same compute budget.
>Even if you disagree with these semantics, the major LLMs today are primarily trained on natural language.
The major LLMs today are trained on trillions of tokens of text, much of which has nothing to do with language beyond the means of communication, millions of images and million(s) of hours of audio.
The problem as I tried to explain is that you're packing more meaning into 'Language Model' than you should. Being trained on text does not mean all your responses are modelled via language as you seem to imply. Even for a model trained on text, only the first and last few layers of a LLM concerns language.
I see no value in engaging further.
Yeah I'm not the one who doesn't understand the distinction between transformers and other potential LM architectures if your words are anything to go by, but sure, feel free to do whatever you want regardless.
This is irrelevant, the point is that you do have access to a world which LLMs don't, at all. They only get the text we produce after we interact with the world. It is working with "compressed data" at all times, and have absolutely no idea what we subconsciously internalized that we decided not to write down or why.
It doesn't matter whether LLMs have "complete" (nothing does) or human-like world access, but whether the compression in text is lossy in ways that fundamentally prevent useful world modeling or reconstruction. And empirically... it doesn't seem to be. Text contains an enormous amount of implicit structure about how the world works, precisely because humans writing it did interact with the world and encoded those patterns.
And your subconscious is far leakier than you imagine. Your internal state will bleed into your writing, one way or another whether you're aware of it or not. Models can learn to reconstruct arithmetic algorithms given just operation and answer with no instruction. What sort of things have LLMs reconstructed after being trained on trillions of tokens of data ?
You are denouncing a claim that the comment you're replying to did not make.
>(2) language only somewhat models the world
is completely irrelevant.
Everyone is only 'somewhat modeling' the world. Humans, Animals, and LLMs.
Your argument would suggest that because you learned about quantum mechanics through language (textbooks, lectures), you only have access to "humans' modeling of humans' modeling of quantum mechanics" - an infinite regress that's clearly absurd.
This is a deranged and factually and tautologically (definitionally) false claim. LLMs can only work with tokenizations of texts written by people who produce those text to represent their actual models. All this removal and all these intermediate representational steps make LLMs a priori obviously even more distant from reality than humans. This is all definitional, what you are saying is just nonsense.
> When an LLM learns physics from textbooks, scientific papers, and code, it's learning the same compressed representations of reality that humans use, not a "model of a model."
A model is a compressed representation of reality. Physics is a model of the mechanics of various parts of the universe, i.e. "learning physics" is "learning a physical model". So, clarifying, the above sentence is
> When an LLM learns physical models from textbooks, scientific papers, and code, it's learning the model of reality that humans use, not a "model of a model."
This is clearly factually wrong, as the model that humans actually use is not the summaries written in textbooks, but the actual embodied and symbolic model that they use in reality, and which they only translate in corrupted and simplified, limited form to text (and that latter diminished form of all things is all the LLM can see). It is also not clear the LLM learns to actually do physics: it only learns how to write about physics like how humans do, but it doesn't mean it can run labs, interpret experiments, or apply models to novel contexts like humans can, or operate at the same level as humans. It clearly is learning something different from humans because it doesn't have the same sources of info.
> Your argument would suggest that because you learned about quantum mechanics through language (textbooks, lectures), you only have access to "humans' modeling of humans' modeling of quantum mechanics" - an infinite regress that's clearly absurd.
There is no infinite regress: humans actually verify that the things they learn and say are correct and provide effects, and update models accordingly. They do this by trying behaviours consistent with the learned model, and seeing how reality (other people, the physical world) responds (in degree and kind). LLMs have no conception of correctness or truth (not in any of the loss functions), and are trained and then done.
Humans can't learn solely from digesting texts either. Anyone who has done math knows that reading a textbook doesn't teach you almost anything, you have to actually solve the problems (and attempted-solving is not in much/any texts) and discuss your solutions and reasoning with others. Other domains involving embodied skills, like cooking, require other kinds of feedback from the environment and others. But LLMs are imprisoned in tokens.
EDIT: No serious researcher thinks LLMs are the way to AGI, this hasn't been a controversial opinion even among enthusiasts since about mid-2025 or so. This stuff about language is all trivial and basic stuff accepted by people in the field, and why things like V-JEPA-2 are being researched. So the comments here attempting to argue otherwise are really quite embarrassing.
Strong words for a weak argument. LLMs are trained on data generated by physical processes (keystrokes, sensors, cameras), not telepathically extracted "mental models." The text itself is the artifact of reality and not just a description of someone's internal state. If a sensor records the temperature and writes it to a log, is the log a "model of a model"? No, it’s a data trace of a physical reality.
>All this removal and all these intermediate representational steps make LLMs a priori obviously even more distant from reality than humans.
You're conflating mediation with distance. A photograph is "mediated" but can capture details invisible to human perception. Your eye mediates photons through biochemical cascades-equally "removed" from raw reality. Proximity isn't measured by steps in a causal chain.
>The model humans use is embodied, not the textbook summaries - LLMs only see the diminished form
You need to stop thinking that a textbook is a "corruption" of some pristine embodied understanding. Most human physics knowledge also comes from text, equations, and symbolic manipulation - not direct embodied experience with quantum fields. A physicist's understanding of QED is symbolic, not embodied. You've never felt a quark.
The "embodied" vs "symbolic" distinction doesn't privilege human learning the way you think. Most abstract human knowledge is also mediated through symbols.
>It's not clear LLMs learn to actually do physics - they just learn to write about it
This is testable and falsifiable - and increasingly falsified. LLMs:
Solve novel physics problems they've never seen
Debug code implementing physical simulations
Derive equations using valid mathematical reasoning
Make predictions that match experimental results
If they "only learn to write about physics," they shouldn't succeed at these tasks. The fact that they do suggests they've internalized the functional relationships, not just surface-level imitation.
>They can't run labs or interpret experiments like humans
Somewhat true. It's possible but they're not very good at it - but irrelevant to whether they learn physics models. A paralyzed theoretical physicist who's never run a lab still understands physics. The ability to physically manipulate equipment is orthogonal to understanding the mathematical structure of physical law. You're conflating "understanding physics" with "having a body that can do experimental physics" - those aren't the same thing.
>humans actually verify that the things they learn and say are correct and provide effects, and update models accordingly. They do this by trying behaviours consistent with the learned model, and seeing how reality (other people, the physical world) responds (in degree and kind). LLMs have no conception of correctness or truth (not in any of the loss functions), and are trained and then done.
Gradient descent is literally "trying behaviors consistent with the learned model and seeing how reality responds."
The model makes predictions
The Data provides feedback (the actual next token)
The model updates based on prediction error
This repeats billions of times
That's exactly the verify-update loop you describe for humans. The loss function explicitly encodes "correctness" as prediction accuracy against real data.
>No serious researcher thinks LLMs are the way to AGI... accepted by people in the field
Appeal to authority, also overstated. Plenty of researchers do think so and claiming consensus for your position is just false. LeCunn has been on that train for years so he's not an example of a change of heart. So far, nothing has actually come out of it. Even META isn't using V-JEPA to actually do anything, nevermind anyone else. Call me when these constructions actually best transformers.
>>This is a deranged and factually and tautologically (definitionally) false claim.
>Strong words for a weak argument. LLMs are trained on data generated by physical processes (keystrokes, sensors, cameras), not telepathically extracted "mental models." The text itself is the artifact of reality and not just a description of someone's internal state. If a sensor records the temperature and writes it to a log, is the log a "model of a model"? No, it’s a data trace of a physical reality.
I don't know how you don't see the fallacy immediately. You're implicitly assuming that all data is factual and that therefore training an LLM on cryptographically random data will create an intelligence that learns properties of the real world. You're conflating a property of the training data and transferring it onto LLMs. If you feed flat earth books into the LLM, you will not be told that earth is a sphere and yet that is what you're claiming here (the flat earth book LLM telling you earth is a sphere). The statement is so illogical that it boggles the mind.
No, that’s a complete strawman. I’m not saying the data is "The Truth TM". I’m saying the data is real physical signal in a lot of cases.
If you train a LLM on cryptographically random data, it learns exactly what is there. It learns that there is no predictable structure. That is a property of that "world." The fact that it doesn't learn physics from noise doesn't mean it isn't modeling the data directly, it just means the data it was given has no physics in it.
>If you feed flat earth books into the LLM, you will not be told that earth is a sphere and yet that is what you're claiming here.
If you feed a human only flat-earth books from birth and isolate them from the horizon, they will also tell you the earth is flat. Does that mean the human isn't "modeling the world"? No, it means their world-model is consistent with the (limited) data they’ve received.
Can you name a few? Demis Hassabis (Deepmind CEO) in his recent interview claims that LLMs will not get us to AGI, Ilya Sutskever also says there is something fundamental missing, same with LeCunn obviously etc.
Jared Kaplan (https://www.youtube.com/watch?v=p8Jx4qvDoSo)
Geoffrey Hinton
come to mind. Just saying, I don't think there's a "consensus of 'serious researchers'" here.
"You're conflating", random totally-psychotic mention of "Gradient descent", way too many other intuitive stylistic giveaways. All transparently low-quality midwit AI slop. Anyone who has used ChatGPT 5.2 with basic or extended thinking will recognize the style of the response above.
This kind of LLM usage seems relevant to someone like @dang, but also I can't prove that the posts I am interacting with are LLM-generated, so, I also feel it isn't worthy of report. Not sure what is right / best to do here.
Also just wanted you to know that I'm not downvoting you, and have never downvoted you throughout this entire conversation. So take that what you will.
People do have a privileged access to the 'real world' compared to, for example, LLMs and any future AI. It's called: Consciousness and it is how we experience and come to know and understand the world. Consciousness is the privileged access that AI will never have.
Ok, explain its mechanism and why it gives privileged access. Furthermore I'd go for the Nobel prize and describe the elementary mechanics of consciousness and where the state change from non-conscious versus conscious occurs. It would be enlightening to read your paper.
Just no, please stop making things up because you feel like it. Trying to say one of the most hotly debated ideals in neuroscience has been decided or even well understood is absolutely insane.
Even then you get into, animals have qualia right? But they are not expressive as human qualia, which means it is reducible.
I suspect maybe you haven't done much research into this area? Qualia is pretty well established and has been for a long time.
Animals may have qualia, that's true. Though we can only be sure of our own qualia, because that's all we have access to. Qualia is the constituent parts that make up our subjective conscious experience, the atomized subjective experience, like the color red or the taste of sour.
LLMs being "Language Models" means they model language, it doesn't mean they "model the world with language".
On the contrary, modeling language requires you to also model the world, but that's in the hidden state, and not using language.
LLMs can only model tokens, and tokens are produced by humans trying to model the world. Tokenized models are NOT the only kinds of models humans can produce (we can have visual, kinaesthetic, tactile, gustatory, and all sorts of sensory, non-linguistic models of the world).
LLMs are trained on tokenizations of text, and most of that text is humans attempting to translate their various models of the world into tokenized form. I.e. humans make tokenized models of their actual models (which are still just messy models of the world), and this is what LLMs are trained on.
So, do "LLMS model the world with language"? Well, they are constrained in that they can only model the world that is already modeled by language (generally: tokenized). So the "with" here is vague. But patterns encoded in the hidden state are still patterns of tokens.
Humans can have models that are much more complicated than patterns of tokens. Non-LLM models (e.g. models connected to sensors, such as those in self-driving vehicles, and VLMs) can use more than simple linguistic tokens to model the world, but LLMs are deeply constrained relative to humans, in this very specific sense.
But I know very little about this.
This is mostly incorrect, unless you mean "they both become tensor / vector representations (embeddings)". But these vector representations are not comparable.
E.g. if you have a VLM with a frozen dual-backbone architecture (say, a vision transformer encoder trained on images, and an LLM encoder backbone pre-trained in the usual LLM way), then even if, for example, you design this architecture so the embedding vectors produced by each encoder have the same shape, to be combined via another component, e.g. some unified transformer, it will not be the case that e.g. the cosine similarity between an image embedding and a text embedding is a meaningful quantity (it will just be random nonsense). The representations from each backbone are not identical, and the semantic structure of each space is almost certainly very different.
They present a statistical model of an existing corpus of text.
If this existing corpus includes useful information it can regurgitate that.
It cannot, however, synthesize new facts by combining information from this corpus.
The strongest thing you could feasibly claim is that the corpus itself models the world, and that the LLM is a surrogate for that model. But this is not true either. The corpus of human produced text is messy, containing mistakes, contradictions, and propaganda; it has to be interpreted by someone with an actual world model (a human) in order for it to be applied to any scrnario; your typical corpus is also biased towards internet discussions, the english language, and western prejudices.
The bet OpenAI has made is that if this is the optimal final form, then given enough data and training, gradient descent will eventually build it. And I don't think that's entirely unreasonable, even if we haven't quite reached that point yet. The issues are more in how language is an imperfect description of the world. LLMs seems to be able to navigate the mistakes, contradictions and propaganda with some success, but fail at things like spatial awareness. That's why OpenAI is pushing image models and 3d world models, despite making very little money from them: they are working towards LLMs with more complete world models unchained by language
I'm not sure if they are on the right track, but from a theoretical point I don't see an inherent fault
First, the subjectivity of language.
1) People only speak or write down information that needs to be added to a base "world model" that a listener or receiver already has. This context is extremely important to any form of communication and is entirely missing when you train a pure language model. The subjective experience required to parse the text is missing.
2) When people produce text, there is always a motive to do so which influences the contents of the text. This subjective information component of producing the text is interpreted no different from any "world model" information.
A world model should be as objective as possible. Using language, the most subjective form of information is a bad fit.
The other issue in this argument is that you're inverting the implication. You say an accurate world model will produce the best word model, but then suddenly this is used to imply that any good word model is a useful world model. This does not compute.
Which companies try to address with image, video and 3d world capabilities, to add that missing context. "Video generation as world simulators" is what OpenAI once called it
> When people produce text, there is always a motive to do so which influences the contents of the text. This subjective information component of producing the text is interpreted no different from any "world model" information.
Obviously you need not only a model of the world, but also of the messenger, so you can understand how subjective information relates to the speaker and the world. Similar to what humans do
> The other issue in this argument is that you're inverting the implication. You say an accurate world model will produce the best word model, but then suddenly this is used to imply that any good word model is a useful world model. This does not compute
The argument is that training neural networks with gradient descent is a universal optimizer. It will always try to find weights for the neural network that cause it to produce the "best" results on your training data, in the constraints of your architecture, training time, random chance, etc. If you give it training data that is best solved by learning basic math, with a neural architecture that is capable of learning basic math, gradient descent will teach your model basic math. Give it enough training data that is best solved with a solution that involves building a world model, and a neural network that is capable of encoding this, then gradient descent will eventually create a world model.
Of course in reality this is not simple. Gradient descent loves to "cheat" and find unexpected shortcuts that apply to your training data but don't generalize. Just because it should be principally possible doesn't mean it's easy, but it's at least a path that can be monetized along the way, and for the moment seems to have captivated investors
Let me illustrate the point using a different argument with the same structure: 1) The best professional chefs are excellent at cutting onions 2) Therefore, if we train a model to cuy onions using gradient descent, that model will be a very good profrssional chef
2) clearly does not follow from 1)
Note humans generate their own non-complete world model. For example there are sounds and colors we don’t hear or see. Odors we don’t smell. Etc…. We have an incomplete model of the world, but we still have a model that proves useful for us.
This takes "world model" far too literally. Audio-visual generative AI models that create non-textual "spaces" are not world models in the sense the previous poster meant. I think what they meant by world model is that the vast majority of the knowledge we rely upon to make decisions is tacit, not something that has been digitized, and not something we even know how to meaningfully digitize and model. And even describing it as tacit knowledge falls short; a substantial part of our world model is rooted in our modes of actions, motivations, etc, and not coupled together in simple recursive input -> output chains. There are dimensions to our reality that, before generative AI, didn't see much systematic introspection. Afterall, we're still mired in endless nature v. nurture debates; we have a very poor understanding about ourselves. In particular, we have extremely poor understanding of how we and our constructed social worlds evolve dynamically, and it's that aspect of our behavior that drives the frontier of exploration and discovery.
OTOH, the "world model" contention feels tautological, so I'm not sure how convincing it can be for people on the other side of the debate.
At no point have I seen anyone here as the question of "What is the minimum viable state of a world model".
We as humans with our ego seem to state that because we are complex, any introspective intelligence must be as complex as us to be as intelligent as us. Which doesn't seem too dissimilar to saying a plane must flap its wings to fly.
I think one could probably argue "yes", to "simple tasks in novel environments". This stuff is super new though.
Note the "Planning" and "Robot Manipulation" parts of V-JEPA 2: https://arxiv.org/pdf/2506.09985:
> Planning: We demonstrate that V-JEPA 2-AC, obtained by post-training V-JEPA 2 with only 62 hours of unlabeled robot manipulation data from the popular Droid dataset, can be deployed in new environments to solve prehensile manipulation tasks using planning with given subgoals. Without training on any additional data from robots in our labs, and without any task-specific training or reward, the model successfully handles prehensile manipulation tasks, such as Grasp and Pick-and-Place with novel objects and in new environments.
That would be like saying studying mathematics can't lead to someone discovering new things in mathematics.
Nothing would ever be "novel" if studying the existing knowledge could not lead to novel solutions.
GPT 5.2 Thinking is solving Erdős Problems that had no prior solution - with a proof.
The point is that the LLM did not model maths to do this, made calls to a formal proof tool that did model maths, and was essentially working as the step function to a search algorithm, iterating until it found the zero in the function.
That's clever use of the LLM as a component in a search algorithm, but the secret sauce here is not the LLM but the middleware that operated both the LLM and the formal proof tool.
That middleware was the search tool that a human used to find the solution.
This is not the same as a synthesis of information from the corpus of text.
The LLM is not the main component in such a system.
But I don't know what you're asking exactly. Maybe you could specify what it is you mean by "real world model" and what you take fact-regurgitating to mean.
That sounds highly technical if you ask me. People complain if you recompress music or images with lossy codecs, but when an LLM does that suddenly it's religious?
So LLMs are linguistic (token pattern) models of linguistic models (streams of tokens) describing world models (more than tokens).
It thus does not in fact follow that LLMs model the world (as they are missing everything that is not encoded in non-linguistic semantics).
Sufficient for what?
For example, no matter many books you read about riding a bike, you still need to actually get on a bike and do some practice before you can ride it. The reading can certainly help, at least in theory, but, in practice, is not necessary and may even hurt (if it makes certain processes that need to be unconscious held too strongly in consciousness, due to the linguistic model presented in the book).
This is why LLMs being so strongly tied to natural language is still an important limitation (even it is clearly less limiting than most expected).
This is like saying that no matter how much you know theoretically about a foreign language you still need to train your brain to talk it. It has little to do with the reality of that language or the correctness of your model of it, but rather with the need to train realtime circuits to do some work.
Let me try some variations: "no matter how many books you read about ancient history, you need to have lived there before you can reasonably talk about it". "No matter how many books you have read about quantum mechanics, you need to be a particle..."
To the contrary, this is purely speculative and almost certainly wrong, riding a bike is co-ordinating the realtime circuits in the right way, and language and a linguistic model fundamentally cannot get you there.
There are plenty of other domains like this, where semantic reasoning (e.g. unquantified syllogistic reasoning) just doesn't get you anywhere useful. I gave an example from cooking later in this thread.
You are falling IMO into exactly the trap of the linguistic reductionist, thinking that language is the be-all and end-all of cognition. Talk to e.g. actual mathematicians, and they will generally tell you they may broadly recruit visualization, imagined tactile and proprioceptive senses, and hard-to-vocalize "intuition". One has to claim this is all epiphenomenal, or that e.g. all unconscious thought is secretly using language, to think that all modeling is fundamentally linguistic (or more broadly, token manipulation). This is not a particularly credible or plausible claim given the ubiquity of cognition across animals or from direct human experiences, so the linguistic boundedness of LLMs is very important and relevant.
> You are falling IMO into exactly the trap of the linguistic reductionist, thinking that language is the be-all and end-all of cognition.
I'm not saying that at all. I am saying that any (sufficiently long, varied) coherent speech needs a world model, so if something produces coherent speech, there must be a world model behind. We can agree that the model is lacking as much as the language productions are incoherent: which is very little, these days.
This is circular, because you are assuming their world-model of biking can be expressed in language. It can't!
EDIT: There are plenty of skilled experts, artists and etc. that clearly and obviously have complex world models that let them produce best-in-the-world outputs, but who can't express very precisely how they do this. I would never claim such people have no world model or understanding of what they do. Perhaps we have a semantic / definitional issue here?
Ok. So I think I get it. For me, producing coherent discourse about things requires a world model, because you can't just make up coherent relationships between objects and actions long enough if you don't understand what their properties are and how they relate to each other.
You, on the other hand, claim that there are infinite firsthand sensory experiences (maybe we can call them qualia?) that fall in between the cracks of language and are rarely communicated (though we use for that a wealth of metaphors and synesthesia) and can only be understood by those who have experienced them firsthand.
I can agree with that if that's what you mean, but at the same time I'm not sure they constitute such a big part of our thought and communication. For example, we are discussing about reality in this thread and yet there are no necessary references to first hand experiences. Any time we talk about history, physics, space, maths, philosophy, we're basically juggling concepts in our heads with zero direct experience of them.
Well, not infinite, but, yes! I am indeed claiming much world models are patterns and associations between qualia, and that only some qualia are essentially representable as or look like linguistic tokens (specifically, the sounds of those tokens being pronounced, or their visual shapes if e.g. math symbols). E.g. I am claiming that the way one learns to e.g. cook, or "do theoretical math" may be more about forming associations between those non-linguistic qualia than, say, obviously, doing philosophy is.
> I'm not sure they constitute such a big part of our thought and communication
The communication part is mostly tautological again, but, yes, it remains very much an open question in cognitive science just how exactly thought works. A lot of mathematicians claim to lean heavily on visualization and/or tactile and kinaesthetic modeling for their intuitions (and most deep math is driven by intuition first), but also a lot of mathematicians can produce similar works and disagree about how they think about it intuitively. And we are seeing some progress from e.g. Aristotle using LEAN to generate math proofs in a strictly tokenized / symbolic way, but it remains to be seen if this will ever produce anything truly impressive to mathematicians. So it is really hard to know what actually matters for general human cognition.
I think introspection makes it clear there are a LOT of domains where it is obvious the core knowledge is not mostly linguistic. This is easiest to argue for embodied domains and skills (e.g. anything that requires direct physical interaction with the world), and it is areas like these (e.g. self-driving vehicle AI) where LLMs will be (most likely) least useful in isolation, IMO.
So you can't build an AI model that simulates riding a bike? I'm not stating a LLM model, I'm just saying the kind of AI simulation we've been building virtual worlds with for decades.
So, now that you agree that we can build AI models of simulations, what are those AI models doing. Are they using a binary language that can be summarized?
Calling everything "language" is not some gotcha, the middle "L" in LLM means natural language. Binary code is not "language" in this sense, and these terms matter. Robotics AIs are not LLMs, they are just AI.
Any series of self consistent encoded signals can be language. You could feed an LLM wireless signals if until it learned how to connect to your wifi if you wanted to. Just assign tokens. You're acting like words are something different than encoded information. It's the interconnectivity between those bits of data that matters.
https://news.ycombinator.com/item?id=46948266
This works against your argument. Someone who can ride a bike clearly knows how to ride a bike, that they cannot express it in tokenized form speaks to the limited scope ofof written word in representing embodiment.
Nvidia is out building LLM driven models that work with robot models that simulate robot actions. World simulation was a huge part of AI before LLMs became a thing. With a tight coupling between LLMs and robot models we've seen an explosion in robot capabilities in the last few years.
You know what robots communicate with their actuators and sensors with. Oh yes, binary data. We quite commonly call that words. When you have a set of actions that simulate riding a bicycle in virtual space that can be summarized and described. Who knows if humans can actually read/understand what the model spits out, but that doesn't mean it's invalid.
It would be more precise to say that complex world modeling is not done with LLMs, or that LLMs only supplement those world models. Robotics models are AI, calling them LLMs is incorrect (though they may use them internally in places).
The middle "L" in LLM refers to natural language. Calling everything language and words is not some gotcha, and sensor data is nothing like natural language. There are multiple streams / channels, where language is single-stream; sensor data is continuous and must not be tokenized; there are not long-term dependencies within and across streams in the same way that there are in language (tokens thousands of tokens back are often relevant, but sensor data from more than about a second ago is always irrelevant if we are talking about riding a bike), making self-attention expensive and less obviously useful; outputs are multi-channel and must be continuous and realtime, and it isn't even clear the recursive approach of LLMs could work here.
Another good example of world models informed by work in robotics is V-JEPA 2.
https://ai.meta.com/research/vjepa/
https://arxiv.org/abs/2506.09985
Once you understand that, you realize that the human brain has an internal model of almost everything it is interacting with and replicating human level performance requires the entire human brain, not just isolated parts of it. The reason for this is that since we take our brains for granted, we use even the complicated and hard to replicate parts of the brain for tasks that appear seemingly trivial.
When I take out the trash, organic waste needs to be thrown into the trash bin without the plastic bag. I need to untie the trash bag, pinch it from the other side and then shake it until the bag is empty. You might say big deal, but when you have tea bags or potato peels inside, they get caught on the bag handles and get stuck. You now need to shake the bag in very particular ways to dislodge the waste. Doing this with a humanoid robot is basically impossible, because you would need to model every scrap of waste inside the plastic bag. The much smarter way is to make the situation robot friendly by having the robot carry the organic waste inside a portable plastic bin without handles.
Every single time I travel somewhere new, whatever research I did, whatever reviews or blogs I read or whatever videos I watched become totally meaningless the moment I get there. Because that sliver of knowledge is simply nothing compared to the reality of the place.
Everything you read is through the interpretation of another person. Certainly someone who read a lot of books about ancient history can talk about it - but let's not pretend they have any idea what it was actually like to live there.
For instance, say you are a woman with a lookalike celebrity, someone who is a very close match in hair colour, facial structure, skin tone and body proportions. You would like to browse outfits worn by other celebrities (presumably put together by professional stylists) that look exactly like her. You ask an LLM to list celebrities that look like celebrity X, to then look up outfit inspiration.
No matter how long the list, no matter how detailed the prompt in the features that must be matched, no matter how many rounds you do, the results will be completely unusable, because broad language dominates more specific language in the corpus.
The LLM cannot adequately model these facets, because language is in practice too imprecise, as currently used by people.
To dissect just one such facet, the LLM response will list dozens of people who may share a broad category (red hair), with complete disregard to the exact shade of red, whether or not the hair is dyed and whether or not it is indeed natural hair or a wig.
The number of listicles clustering these actresses together as redheads will dominate anything with more specific qualifiers, like ’strawberry blonde’ (which in general counts as red hair), ’undyed hair’ (which in fact tends to increase the proportion of dyed hair results, because that’s how linguistic vector similarity works sometimes) and ’natural’ (which again seems to translate into ’the most natural looking unnatural’, because that’s how language tends to be used).
In practice it would make heavy use of RL, as humans do.
Oh, so you mean, it would be in a harness of some sort that lets it connect to sensors that tell it things about its position, speed, balance and etc? Well, yes, but then it isn't an LLM anymore, because it has more than language to model things!
Not sure if it’s great just plain text, but would be better if could understand the position internally somehow.
In particular, sensor data doesn't have the same semantics or structure at all as language (it is continuous data and should not be tokenized; it will be multi-channel, i.e. have multiple streams, whereas text is single-channel; outputs need to be multi-channel as well, and realtime, so it is unclear if the LLM recursive approach can work at all or is appropriate). The lack of contextuality / interdependency, both within and between these streams might even mean that e.g. self-attention is not that helpful and just computationally wasteful here. E.g. what was said thousands of tokens ago can be completely relevant and change the meaning of tokens being generated now, but any bike sensor data from more than about a second or so ago is completely irrelevant to all future needed outputs.
Sure, maybe a transformer might still do well processing this data, but an LLM literally can't. It would require significant architectural changes just to be able to accept the inputs and make the outputs.
You know transformers can do math, right?
You're asking someone to answer this question in a text forum. This is not quite the gotcha you think it is.
The distinction between "knowing" and "putting into language" is a rich source of epistemological debate going back to Plato and is still widely regarded to represent a particularly difficult philosophical conundrum. I don't see how you can make this claim with so much certainty.
Riding a bike is, broadly, learning to co-ordinate your muscles in response to visual data from your surroundings and signals from your vestibular and tactile systems that give you data about your movement, orientation, speed, and control. As LLMs only output tokens that represent text, by definition they can NEVER learn to ride a bike.
Even ignoring that glaring definitional issue, an LLM also can't learn to ride a bike from books written by humans to humans, because an LLM could only operate through a machine using e.g. pistons and gears to manipulate the pedals. That system would be controlled by physics and mechanisms different from humans, and not have the same sensory information, so almost no human-written information about (human) bike-riding would be useful or relevant for this machine to learn how to bike. It'd just have to do reinforcement learning with some appropriate rewards and punishments for balance, speed, and falling.
And if we could embody AI in a sensory system so similar to the human sensory system that it becomes plausible text on bike-riding might actually be useful to the AI, it might also be that, for exactly the same reasons, the AI learns just as well to ride just by hopping on the thing, and that the textual content is as useless to it as it is for us.
Thinking this is an obvious gotcha (or the later comment that anyone thinking otherwise is going to have egg on their face) is just embarrassing. Much more of a wordcel problem than I would have expected on HN.
Can claim the same as your brain alone can’t control the bike because can’t access muscle tool.
You need to compare similar stuff
No, it literally cannot. See the reasons I give elsewhere here: https://news.ycombinator.com/item?id=46948266.
A more general AI trained with reinforcement learning and a novel architecture could surely learn to ride a bike, but an LLM simply cannot.
Related: Have you seen nvidea with their simulated 3d env. That might not be called llm but it’s not very far away from what our llm actually do right now. It’s just a naming difference
It synthesizes comments on “RL Environments” (https://ankitmaloo.com/rl-env/), “World Models” (https://ankitmaloo.com/world-models/) and the real reason that the “Google Game Arena” (https://blog.google/innovation-and-ai/models-and-research/go...) is so important to powering LLMs. In a sense it also relates to the notion of “taste” (https://wangcong.org/2026-01-13-personal-taste-is-the-moat.h...) and how / if it’s moat-worthiness can be eliminated by models.
For a screen (coding, writing emails, updating docs) -> you can create world models with episodic memories that can be used as background context before making a new move (action). Many professions rely partially on email or phone (voice) so LLMs can be trained for world models in these context. Just not every context.
The key is giving episodic memory to agents with visual context about the screen and conversation context. Multiple episodes of similar context can be used to make the next move. That's what I'm building on.
All the additional information in the world isn't going to help an LLM-based AI conceal its poker-betting strategy, because it fundamentally has no concept of its adversarial opponent's mind, past echoes written in word form.
Cliche allegory of the cave, but LLM vs world is about switching from training artificial intelligence on shadows to the objects casting the shadows.
Sure, you have more data on shadows in trainable form, but it's an open question on whether you can reliably materialize a useful concept of the object from enough shadows. (Likely yes for some problems, no for others)
There are fundamentally different types of games that map to real world problems. See: https://en.wikipedia.org/wiki/Game_theory#Different_types_of...
The hypothesis from the post, to boil it down, is that LLMs are successful in some of these but architecturally ill-suited for others.
It's not about what or how an LLM knows, but the epistemological basis for its intelligence and what set of problems that can cover.
It is an awfully weak signal to pick up in data.
With all the social science research and strategy books that LLMs have read, they actually know a LOT about outcomes and dynamics in adversarial situations.
The author does have a point though that LLMs can’t learn these from their human-in-the-loop reinforcement (which is too controlled or simplified to be meaningful).
Also, I suspect the _word_ models of LLMs are not inherently the problem, they are just inefficient representations of world models.
The articles will not be mutually consistent, and what output the LLM produces will therefore depend on what article the prompt most resembles in vector space and which numbers the RNG happens to produce on any particular prompt.
I don’t think essentialist explanations about how LLMs work are very helpful. It doesn’t give any meaningful explanation of the high level nature of the pattern matching that LLMs are capable of. And it draws a dichotomic line between basic pattern matching and knowledge and reasoning, when it is much more complex than that.
I don't think this is right: To calculate the optimal path, you do need to model human cognition.
At least, in the sense that finding the best path requires figuring out human concepts like "is the king vulnerable", "material value", "rook activity", etc. We have actual evidence of AlphaZero calculating those things in a way that is at least somewhat like humans do:
https://arxiv.org/abs/2111.09259
So even chess has "hidden state" in a significant sense: you can't play well without calculating those values, which are far from the surface.
I'm not sure there's a clear line between chess and poker like the author assumes.
What i think you are referring to is hidden state as in internal representations. I refer to hidden state in game theoretic terms like a private information only one party has. I think we both agree alphazero has hidden states in first sense.
Concepts like king safety are objectively useful for winning at chess so alphazero developed it too, no wonder about that. Great example of convergence. However, alphazero did not need to know what i am thinking or how i play to beat me. In poker, you must model a player's private cards and beliefs.
> UPD September 15, 2025: Reasoning models opened a new chapter in Chess performance, the most recent models, such as GPT-5, can play reasonable chess, even beating an average chess.com player.
It’s a limitation LLMs will have for some time. Being multi-turn with long range consequences the only way to truly learn and play “the game” is to experience significant amounts of it. Embody an adversarial lawyer, a software engineer trying to get projects through a giant org..
My suspicion is agents can’t play as equals until they start to act as full participants - very sci fi indeed..
Putting non-humans into the game can’t help but change it in new ways - people already decry slop and that’s only humans acting in subordination to agents. Full agents - with all the uncertainty about intentions - will turn skepticism up to 11.
“Who’s playing at what” is and always was a social phenomenon, much larger than any multi turn interaction, so adding non-human agents looks like today’s game, just intensified. There are ever-evolving ways to prove your intentions & human-ness and that will remain true. Those who don’t keep up will continue to risk getting tricked - for example by scammers using deepfakes. But the evolution will speed up and the protocols to become trustworthy get more complex..
Except in cultures where getting wasted is part of doing business. AI will have it tough there :)
And then "Large Language Models are Few Shot Learners" collided with Sam Altman's ambition/unscrupulousness and now TensorRT-LLM is dictating the shape of data centers in a self reinforcing loop.
LLMs are interesting and useful but the tail is wagging the dog because of path-dependent corruption arbitraging a fragile governance model. You can get a model trained on text corpora to balance nested delimiters via paged attention if you're willing to sell enough bonds, but you could also just do the parse with a PDA from the 60s and use the FLOPs for something useful.
We had it right: dial in an ever-growing set of tasks, opportunistically unify on durable generalities, put in the work.
Instead we asserted generality, lied about the numbers, and lit a trillion dollars on fire.
We've clearly got new capabilities, it's not a total write off, but God damn was this an expensive ways to spend five years making two years of progress.
Not all bad then. Hopefully this impedes them getting more.
I believe the author addresses it in the article:
> many domains are chess-like in their technical core but become poker-like in their operational context.
Also applicable to programming.
Programming has parts like chess that are bounded and what people assume to be actual work. However, what LLMs don't do well is understanding future requirements, stakeholder incenctives, etc.
In the work environment the optimal strategy will be parameterised culturally.
Companies have different cultures - both at the company level and at the country level.
In some places self-interest is the accepted driving force, in others if you behave like that you'll find yourself quietly ostracised.
For example, I'm not sure Trump understands this.
The slack message result for example saying they’re the lead designer but nothing else (with clearer and better feedback if I say they’re notoriously overloaded, this is without that)
This is a very polite and respectful start, which is great since you are new. However, from a professional standpoint, it is a little too passive and vague. In a busy work environment, saying "no rush at all" or "whenever" often leads to your request being buried at the bottom of a to-do list. Additionally, you haven't told Priya exactly what she is looking at or where to find it. Here is a breakdown of how to strengthen this message to show you are organized and respectful of her time without being pushy. Key Improvements Needed * Include the Link: Never make someone ask "Where are the files?" Always drop the Figma/Drive link immediately. * Be Specific: "Any feedback" is mentally taxing. Direct her attention to what you are actually unsure about (e.g., "the color palette," "the user flow," "the spacing"). * Soft Deadline: Instead of "whenever," give a target. This helps her prioritize her week. Option 1: The Balanced Approach (Recommended) This version strikes the perfect balance between politeness and productivity. > "Hi Priya! I’ve put together the initial mockups for [Project Name]. > [Link to file] > When you have a moment, I’d love your feedback on [specific element, e.g., the layout of the landing page]. I’m hoping to wrap this stage up by [Day/Time], but let me know what works for your schedule. Thanks!" > Option 2: The "Respectful of Time" Approach Use this if you know she is swamped, but you still want to make progress. > "Hi Priya, hope you're having a good week. I've drafted the mockups for [Project Name] here: [Link]. > No major rush, but I’d appreciate your eyes on the [specific section] before I move on to high-fidelity. If it’s easier, feel free to leave comments directly in the file rather than messaging back. Thanks!" > Option 3: The "Quick Check" Approach Use this if you just need a sanity check before spending more time on it. > "Hi Priya, could I get a quick gut check on these mockups? [Link]. > I want to make sure I’m aligning with the new brand guidelines before I build out the rest of the screens. Does this direction look right to you?" > A Quick Checklist Before You Send * [ ] Did you change the permissions? Make sure the link is accessible (view/comment access) so she doesn't have to request access. * [ ] Is the file clean? Delete your "scratchpad" artboards or clearly label the one you want her to review so she doesn't look at the wrong version. Would you like me to help you draft the specific sentence regarding the "specific element" you want her to critique?
> Humans can model the LLM. The LLM can’t model being modeled
Can’t they? Why not?
Also, Priya is in the same "world" as everyone else. They have the context that the new person is 3 weeks in and must probably need some help because they're new, are actually reaching out, and impressions matter, even if they said "not urgent". "Not urgent" seldom is taken at face value. It doesn't necessarily mean it's urgent, but it means "I need help, but I'm being polite".
DevOps engineers who acted like the memes changed everything! The cloud will save us!
Until recently the US was quite religious; 80%+ around 2000 down to 60%s now. Longtermism dogma of one kind or another rules those brains; endless growth in economics, longtermism. Those ideal are baked into biochemical loops regardless of the semantics the body may express them in.
Unfortunately for all the disciples time is not linear. No center to the universe means no single epoch to measure from. Humans have different birthdays and are influenced by information along different timelines.
A whole lot of brains are struggling with the realization they were bought into a meme and physics never really cared about their goals. The next generation isn't going to just pick up the meme-baton validate the elders dogma.
Computing has nothing to do with the machine.
The first application of the term "computer" was humans doing math with an abacus and slide ruler.
Turing machines and bits are not the only viable model. That little in-between generation only knows a tiny bit about "computing" using machines IBM and Apple, Intel, etc, propagandized them into buying. All computing must fit our model machine!
Different semantics but same idea as my point about DevOps.
Everyone wants star trek, but we're all gunna get star wars lol.
Can we get to another level without a corresponding massive training set that demonstrates those abilities?
All of our theorem provers had no way to approach silver medal performance despite decades of algorithmic leaps.
Learning stage for transformers has a while ago demonstrated some insanely good distributed jumps into good areas of combinatorial structures. Inference is just much faster than inference of algorithms that aren’t heavily informed by data.
It’s just a fully different distributed algorithm where we can’t probably even extract one working piece without breaking the performance of the whole.
World/word model is just not the case there. Gradient descent obviously landed to a distributed representation of an algorithm that does search.
But I do think that there is further underlying structure that can be exploited. A lot of recent work on geometric and latent interpretations of reasoning, geometric approaches to accelerate grokking, and as linear replacements for attention are promising directions, and multimodal training will further improve semantic synthesis.
E.g. syllogistic arguments based on linguistic semantics can lead you deeply astray if you those arguments don't properly measure and quantify at each step.
I ran into this in a somewhat trivial case recently, trying to get ChatGPT to tell me if washing mushrooms ever really actually matters practically in cooking (anyone who cooks and has tested knows, in fact, a quick wash has basically no impact ever for any conceivable cooking method, except if you wash e.g. after cutting and are immediately serving them raw).
Until I forced it to cite respectable sources, it just repeated the usual (false) advice about not washing (i.e. most of the training data is wrong and repeats a myth), and it even gave absolute nonsense arguments about water percentages and thermal energy required for evaporating even small amounts of surface water as pushback (i.e. using theory that just isn't relevant when you actually properly quantify). It also made up stuff about surface moisture interfering with breading (when all competent breading has a dredging step that actually won't work if the surface is bone dry anyway...), and only after a lot of prompts and demands to only make claims supported by reputable sources, did it finally find McGee's and Kenji Lopez's actual empirical tests showing that it just doesn't matter practically.
So because the training data is utterly polluted for cooking, and since it has no ACTUAL understanding or model of how things in cooking actually work, and since physics and chemistry are actually not very useful when it comes to the messy reality of cooking, LLMs really fail quite horribly at producing useful info for cooking.
People won’t even admit their sexual desires to themselves and yet they keep shaping the world. Can ChatGPT access that information somehow?
Or at least, this is the case if we mean LLM in the classic sense, where the "language" in the middle L refers to natural language. Also note GP carefully mentioned the importance of multimodality, which, if you include e.g. images, audio, and video in this, starts to look like much closer to the majority of the same kinds of inputs humans learn from. LLMs can't go too far, for sure, but VLMs could conceivably go much, much farther.
So right now the limitation is that an LMM is probably not trained on any images or audio that is going to be helpful for stuff outside specific tasks. E.g. I'm sure years of recorded customer service calls might make LMMs good at replacing a lot of call-centre work, but the relative absence of e.g. unedited videos of people cooking is going to mean that LLMs just fall back to mostly text when it comes to providing cooking advice (and this is why they so often fail here).
But yes, that's why the modality caveat is so important. We're still nowhere close to the ceiling for LMMs.
Absolutely. There is only one model that can consistently produce novel sentences that aren't absurd, and that is a world model.
> People won’t even admit their sexual desires to themselves and yet they keep shaping the world
How do you know about other people's sexual desires then, if not through language? (excluding a very limited first hand experience)
Sure. Just like any other information. The system makes a prediction. If the prediction does not use sexual desires as a factor, it's more likely to be wrong. Backpropagation deals with it.
Math is language, and we've modelled a lot of the universe with math. I think there's still a lot of synthesis needed to bridge visual, auditory and linguistic modalities though.
I've found ChatGPT, especially "5.2 Thinking" to be very helpful in the relatively static world of fabrication. CNC cutting parameters for a new material? Gets me right in the neighborhood in minutes (not perfect, but good parameters to start). Identifying materials to compliment something I have to work with? Again, like a smart assistant. Same for generating lists of items I might be missing in prepping for a meeting or proposal.
But the high-level attorney in the family? Awful, and definitely in the ways identified (the biglaw firm is using MS derivative of OpenAI) - it thinks only statically.
BUT, it is also far worse than that for legal. And this is not a problem of dynamic vs. static or world model vs word model.
This problem is the ancient rule of Garbage In Garbage Out.
In any legal specialty there are a small set of top-level experts, and a horde of low-level pretenders who also hang out their shingle in the same field. Worse yet, the pretenders write a LOT of articles about the field to market themselves as experts. These self-published documents look good enough to non-lawyers to bring in business. But they are often deeply and even catastrophically wrong.
The problem is that LLMs ingest ALL of them with credulity, and LLM's cannot or do not tell the difference. So, when an LLM composes something, it is more likely to lie to you or fabricate some weird triangulation as it is to compose a good answer. And, unless you are an EXPERT lawyer, you will not be able to tell the difference until it is far too late and the flaw has already bitten you.
It is only one of the problems, and it's great to have an article that so clearly identifies it.
https://news.ycombinator.com/newsguidelines.html
A more "Eastern" perspective might recognize that much deep knowledge cannot be encoded linguistically ("The Tao that can be spoken is not the eternal Tao", etc.), and there is more broad recognition of the importance of unconscious processes and change (or at least more skepticism of the conscious mind). Freud was the first real major challenge to some of this stuff in the West, but nowadays it is more common than not for people to dismiss the idea that unconscious stuff might be far more important than the small amount of things we happen to notice in the conscious mind.
The (obviously false) assumptions about the importance of conscious linguistic modeling are what lead to people say (obviously false) things like "How do you know your thinking isn't actually just like LLM reasoning?".
Regarding conscious vs non-conscious processes:
Inference is actually non-conscious process because nothing is observed by the model.
Auto regression is conscious process because model observes its own output, ie it has self-referential access.
Ie models use both and early/mid layers perform highly abstracted non-conscious processes.
Otherwise, I don't understand the way you are using "conscious" and "unconscious" here.
My main point about conscious reasoning is that when we introspect to try to understand our thinking, we tend to see e.g. linguistic, imagistic, tactile, and various sensory processes / representations. Some people focus only on the linguistic parts and downplay e.g. imagery ("wordcels vs. shape rotators meme"), but in either case, it is a common mistake to think the most important parts of thinking must always necessarily be (1) linguistic, (2) are clearly related to what appears during introspection.
Your first comment was refering to unconscious, now you don't mention it.
Regarding "conscious and linguistic" which you seem to be touching on now, taking aside multimodality - text itself is way richer for llms than for humans. Trivial example may be ie. mermaid diagram which describes some complex topology, svg which describes some complex vector graphic or complex program or web application - all are textual but to understand and create them model must operate in non linguistic domains.
Even pure text-to-text models have ability to operate in other than linguistic domains, but they are not text-to-text only, they can ingest images directly as well.
Everything you said about how data flows in these multimodal models is not true in general (see https://huggingface.co/blog/vlms-2025), and unless you happen to work for OpenAI or other frontier AI companies, you don't know for sure how they are corralling data either.
Companies will of course engage in marketing and claim e.g. ChatGPT is a single "model", but, architecturally and in practice, this at least is known not to be accurate. The modalities and backbones in general remain quite separate, both architecturally and in terms of pre-training approaches. You are talking at a high level of abstraction that suggests education from blog posts by non-experts: actually read papers on how the architectures of these multimodal models are actually trained, developed, and connected, and you'll see the multi-modality is still very limited.
Also, and most importantly, the integration of modalities is primarily of the form:
and not of the form I.e. most multimodal work is using linguistic models to represent or describe images linguistically, in the hope that the linguistic parts do the majority of the thinking and processing, but there is not much work using the image or video representations to do thinking, i.e. you "convert away" from most modalities into language, do work with token representations, and then maybe go back to images.But there isn't much work on working with visuospatial world models or representations for the actual work (though there is some very cutting edge work here, e.g. Sam-3D https://ai.meta.com/blog/sam-3d/, and V-JEPA-2 https://ai.meta.com/research/vjepa/). But precisely because this stuff is cutting edge, even from frontier AI companies, it is likely most of the LLM stuff you see is largely driven by stuff learned from language, and not from images or other modalities. So LLMs are indeed still mostly constrained by their linguistic core.
I am not really fond of us "westerners", but judjing how many "easterners" treat their populace they seem to confirm the point
What do you mean by sweep training here?
Major error. The LLM made that text without evaluating it at all. It just parrotted words it previously saw humans use in superficially similar word contexts.
The mistake is treating “model” as a single property, instead of separating cognition from decision.
LLMs clearly do more than surface-level word association. They encode stable relational structure: entities, roles, temporal order, causal regularities, social dynamics, counterfactuals. Language itself is a compressed record of world structure, and models trained on enough of it inevitably internalize a lot of that structure. Calling this “just a word model” undersells what’s actually happening internally.
At the same time, critics are right that these systems lack autonomous grounding. They don’t perceive, act, or test hypotheses against reality on their own. Corrections come from training data, tools, or humans. Treating their internal coherence as if it were direct access to reality is a category error.
But here’s the part both sides usually miss: the real risk isn’t representational depth, it’s authority.
There’s a difference between:
cognition: exploring possibilities, tracking constraints, simulating implications, holding multiple interpretations; and
decision: collapsing that space into a single claim about what is, what matters, or what someone thinks.
LLMs are quite good at the first. They are not inherently entitled to the second.
Most failures people worry about don’t come from models lacking structure. They come from models (or users) quietly treating cognition as decision:
coherence as truth,
explanation as diagnosis,
simulation as fact,
“this sounds right” as “this is settled.”
That’s why “world model” language is dangerous if it’s taken to imply authority. It subtly licenses conclusions the system isn’t grounded or authorized to make—about reality, about causation, or about a user’s intent or error.
A cleaner way to state the situation is:
> These systems build rich internal representations that are often world-relevant, but they do not have autonomous authority to turn those representations into claims without external grounding or explicit human commitment.
Under that framing:
The “word model” camp is right to worry about overconfidence and false grounding.
The “world model” camp is right that the internal structure is far richer than token statistics.
They’re arguing about different failure modes, but using the same overloaded word.
Once you separate cognition from decision, the debate mostly dissolves. The important question stops being “does it understand the world?” and becomes “when, and under what conditions, should its outputs be treated as authoritative?”
That’s where the real safety and reliability issues actually live.