Until we have world models, that is exactly what they are. They literally only understand text, and what text is likely given previous text. They are very good at this, because we've given it a metric ton of training data. Everything is "what does a response to this look like?"
This limitation is exactly why "reasoning models" work so well: if the "thinking" step is not persisted to text, it does not exist, and the LLM cannot act on it.
Text comes in, text goes out, but there's a lot of complexity in the middle. It's not a "world model", but there's definitely modeling of the world going on inside.
There is zero modeling of the world going on inside, for the very simple reason that it has never seen the world. It's only been given text, which means it has no idea why that text was written. This is the fundamental limitation of all LLMs: they are only trained on text that humans have written after processing the world. You can't "uncompress" the text to get back what the world state was to understand what led to it being written.
I don't see why only understanding text is completely associated with 'schastic-parrot'-ness. There are blind-deaf people around (mostly interacting through reading braille I think) which are definitely not stochastic parrots.
Moreover, they do have a little bit of Reinforcement Learning on top of reproducing their training corpus.
I believe there has to be some even if very primitive form of thinking (and something like creativity even) even to do the usual (non-RL, supervised) LLMs job of text continuation.
The most problematic thing is humans tend to abhor middle grounds. Either it thinks or it doesn't. Either it's an unthinking dead machine, a s.p., or human-like AGI. The reality is probably in between (maybe still more on the side of s.p. s, definitely with some genuine intelligence, but with some unknown, probably small, sentience as of yet). Reminder that sentience and not intelligence is what should give it rights.
Because blind-deaf people interact with the world directly. LLMs do not, cannot, and have never seen the actual world. A better analogy would be a blind-deaf person born in Plato's Cave, reading text all day. They have no idea why these things were written, or what they actually represent.
Nobody tries to jail the automobile being driven when it hits a pedestrian when on cruise control. The driver is responsible for knowing the limits of the tool and adjusting accordingly.
I have no idea why Google is wasting their time with this. Trying to hallucinate an entire world is a dead-end. There will never be enough predictability in the output for it to be cohesive in any meaningful way, by design. Why are they not training models to help write games instead? You wouldn't have to worry about permanence and consistency at all, since they would be enforced by the code, like all games today.
Look at how much prompting it takes to vibe code a prototype. And they want us to think we'll be able to prompt a whole world?
This was a common argument against LLMs, that the space of possible next tokens is so vast that eventually a long enough sequence will necessarily decay into nonsense, or at least that compounding error will have the same effect.
Problem is, that's not what we've observed to happen as these models get better. In reality there is some metaphysical coarse-grained substrate of physics/semantics/whatever[1] which these models can apparently construct for themselves in pursuit of ~whatever~ goal they're after.
The initially stated position, and your position: "trying to hallucinate an entire world is a dead-end", is a sort of maximally-pessimistic 'the universe is maximally-irreducible' claim.
And going back a little further, it was thought that backpropagation would be impractical, and trying to train neural networks was a dead end. Then people tried it and it worked just fine.
> Problem is, that's not what we've observed to happen as these models get better
Eh? Context rot is extremely well known. The longer you let the context grow, the worse LLMs perform. Many coding agents will pre-emptively compact the context or force you to start a new session altogether because of this. For Genie to create a consistent world, it needs to maintain context of everything, forever. No matter how good it gets, there will always be a limit. This is not a problem if you use a game engine and code it up instead.
We're talking about context here though. The first couple seconds of Genie are great, but over time it degrades. It will always degrade, because it's hallucinating a world and needs to keep track of too many things.
That has traditionally been the problem with these types of models, but Genie is supposed to maintain coherence up to 60 seconds.
I've tried using it a couple of times, but can't get in. It is either down or hopelessly underprovisioned by Google. Do you have any links to videos showing that the quality degrades after only a few seconds?
Edit: no, it just doesn't work in Firefox. It works incredibly well, at least in Chrome, and it does not lose coherence to any great extent. The controls are terrible, though.
Imo they explain pretty well what they are trying to achieve with SIMA and Genie in the Google Deepmind Podcast[1].
They see it as the way to get to AGI by letting AI agents learn for themselves in simulated worlds. Kind of like how they let AlphaGo train for Go in an enormous amount of simulated games.
That makes even less sense, because an AI agent cannot learn effectively from a hallucinated world without internal consistency guarantees. It's an even stronger case for leveraging standard game engines instead.
If that's the goal, the technology for how these agents "learn" would be the most interesting one, even more than the demos in the link.
LLMs can barely remember the coding style I keep asking it to stick to despite numerous prompts, stuffing that guideline into my (whatever is the newest flavour of product-specific markdown file). They keep expanding the context window to work around that problem.
If they have something for long-term learning and growth that can help AI agents, they should be leveraging it for competitive advantage.
Take the positive spin. What if you could put in all the inputs and it can simulate real world scenarios you can walk through to benefit mankind e.g disaster scenarios, events, plane crashes, traffic patterns. I mean there's a lot of useful applications for it. I don't like the framing at this time, but I also get where it's going. The engineer in me is drawn to it, but the Muslim in me is very scared to hear anyone talk about creating worlds.... But again I have to separate my view from the reality that this could have very positive real world benefits when you can simulate scenarios. So I could put in a 2 pager or 10 page scenario that gets played out or simulated and allow me to walk through it. Not just predictive stuff but let's say things that have happened so I can map crime scenes or anything. In the end this performance art is because they are a product company being Benchmarked by wall street and they'll need customers for the technology but at the same time they probably already have uses for it internally.
> What if you could put in all the inputs and it can simulate real world scenarios you can walk through to benefit mankind e.g disaster scenarios, events, plane crashes, traffic patterns.
This is only a useful premise if it can do any of those things accurately, as opposed to dreaming up something kinda plausible based on an amalgamation of every vaguely related YouTube video.
> What if you could put in all the inputs and it can simulate real world scenarios you can walk through to benefit mankind e.g disaster scenarios, events, plane crashes, traffic patterns.
What's the use? Current scientific models clearly showing natural disasters and how to prevent them are being ignored. Hell, ignoring scientific consensus is a fantastic political platform.
An hybrid approach could maybe work, have a more or less standard game engine for coherence and use this kind of generative AI more or less as a short term rendering and physics sim engine.
I've thought about this same idea but it probably gets very complicated.
Let's say, you simulate a long museum hallway with some vases in it. Who holds what? The basic game engine has the geometry, but once the player pushes it and moves it, it needs to inform the engine it did, and then to draw the next frame, read from the engine first, update the position in the video feed, then again feed it back to the engine.
What happens if the state diverges. Who wins? If the AI wins then...why have the engine at all?
It is possible but then who controls physics. The engine? or the AI? The AI could have a different understanding of the details of the base. What happens if the vase has water inside? who simulates that? what happens if the AI decides to break the vase? who simulates the AI.
I don't doubt that some sort of scratchpad to keep track of stuff in game would be useful, but I suspect the researchers are expecting the AI to keep track of everything in its own "head" cause that's the most flexible solution.
Then maybe the engine should be less about really simulating the 3D world and just trying best to preserve consistency, more about providing memory and saving context for consistency than truly simulating a lot besides higher level concerns (at which point we might wonder if it couldn't be directly part of the model somehow), but writing those lines I realize there would probably still be many edge cases exactly like what you are describing...
Why is it a dead end, you don’t meaningfully explain that. These models look like you can interact with them and they seem to replicate physics models.
They don't though, they're hallucinated videos. They're feeding models tons and tons of 2D videos and hoping they figure out physics from them, instead of just using a game engine and having the LLM write something up that works 100% of the time.
On the flip side, the emergent properties that come from some of these wouldn’t be replicable by an engine. A moss covered rock realistically shedding moss as it rolls down a hill. Condensation aggregating into beads and rivulets on glass. An ant walking on a pitcher plant and being able to walk inside it and see bugs drowned from its previous meal. You’re missing the forest for the trees.
And then the rivulets disappear or change completely because you looked away. The reason this is a dead end is because computationally, there is absolutely no way for the model to keep track of everything that it decided. Everything is kept "in its head" rather than persisted. So what you get is a dream world, useless for training world models. It's great for prototyping, terrible for anything more durable.
As a kid in the early 1980s, I spent a lot of time experimenting with computers by playing basic games and drawing with crude applications. And it was fun. I would have loved to have something like Google's Genie to play with. Even if it never evolved, the product in the demos looks good enough for people to get value from.
> Why are they not training models to help write games instead?
Genie isn't about making games... Granted, they for some reason they don't put this at the top. Classic Google, not communicating well...
| It simulates physics and interactions for dynamic worlds, while its breakthrough consistency enables the simulation of any real-world scenario — from robotics and modelling animation and fiction, to exploring locations and historical settings.
The key part is simulation. That's what they are building this for. Ignore everything else.
Same with Nvidia's Earth 2 and Cosmos (and a bit like Isaac). Games or VR environments are not the primary drive, the primary drive is training robots (including non-humanoids, such as Waymo) and just getting the data. It's exactly because of this that perfect physics (or let's be honest, realistic physics[0,1]). Getting 50% of the way there in simulation really does cut down the costs of development, even if we recognize that cost steepens as we approach "there". I really wish they didn't call them "world models" or more specifically didn't shove the word "physics" in there, but hey, is it really marketing if they don't claim a golden goose can not only lay actual gold eggs but also diamonds and that its honks cure cancer?
[0] Looking right does not mean it is right. Maybe it'll match your intuition or undergrad general physics classes with calculus but talk to a real physicist if you doubt me here. Even one with just an undergrad will tell you this physics is unrealistic and any one worth their salt will tell you how unintuitive physics ends up being as you get realistic, even well before approaching quantum. Go talk to the HPC folks and ask them why they need superocmputers... Sorry, physics can't be done from observation alone.
[1] Seriously, I mean look at their demo page. It really is impressive, don't get me wrong, but I can't find a single video that doesn't have major physics problems. That "A high-altitude open world featuring deformable snow terrain." looks like it is simulating Legolas[2], not a real person. The work is impressive, but it isn't anywhere near realistic https://deepmind.google/models/genie/
But it's not simulating, is it? It's hallucinating videos with an input channel to guide what the video looks like. Why do that instead of just picking Unreal, Unity, etc and having it actually simulated for a fraction of the effort?
Depends on your definition of simulation but yeah, I think you understand.
I think it really comes down to dev time and adaptability. But honestly I'm fairly with you. I don't think this is a great route. I have a lot of experience in synthetic data generation and nothing beats high quality data. I do think we should develop world models but I wouldn't all something a world model unless it actually models a physics. And I mean "a physics" not "what people think of as 'physics'" (i.e. the real world). I mean having a counterfactual representation of an environment. Our physics equations are an extremely compressed representation of our reality. You can't generate these representations through observation alone, and that is the naive part of the usual way to develop world models. But we'd need to go into metaphysics and that's a long conversation not well suited for HN.
These simulations are helping but they have a clear limit to their utility. I think too many people believe that if you just feed the models enough data it'll learn. Hyperscaleing is a misunderstanding of the Bitter Lesson that slows development despite showing some progress.
Both can be true. You're tapping into every line of code publicly available, and your day-to-day really isn't that unique. They're really good at this kind of work.
You didn't write sorting code or assembly code because you were going to need to write it on the job. It gave you a grounding for how datastructures and computers work on a fundamental level. That intuition is what makes picking up minecraft hack mods much easier.
That's the koolaid, but seriously I don't really believe it anymore.
I only had to do this leg work during university to prove that I can be allowed to try and write code for a living.
The grounding as you call it is not required for that at all,since im a dozen levels of abstraction removed from it.
It might be useful if I was a researcher or would work on optimizing complex cutting edge stuff, but 99% of what I do is CRUD apps and REST Apis. That stuff can safely be done by anyone, no need for a degree.
Tbf I'm from Germany so in other places they might allow you to do this job without a degree
But nobody go to college specifically training to do CRUD apps. The point is to give you broad training so that you can do CRUD apps and other stuff too. It is a very bad idea to give extremely specific training at scale, because then you get a workforce that has difficulty adapting to changes. It's like trying to manage a planned economy: there is no point in trying to predict exactly what jobs you will get, so let's make sure you can handle whatever's thrown at you.
I do not understand why this is a big deal. There is no world in which ads are embedded in LLM answers, because you'd need another LLM to determine whether the "placement" was correct and included all the information that the advertiser wanted (and it still won't work 100%). They are putting ads on the side, like they've always done, leveraging all the tech that already exists to do this. This is pretty much a no-brainer for OpenAI and any AI company.
Why is this a problem? You don't need an LLM; you need a "model for detecting whether and where the given context appears". We're so used to LLMs now that we forget these NLPs problems have been worked on for a long time and they don't require a huge computational beast and it takes a few ms to run (on only the response, while it's being streamed).
That's a no-go for advertisers. They need to control their own brand and their image, there're not going to leave it up to chance like "ok so uh just mention us or something and then we'll pay you." They pay for very specific placements, and they need to provide the content.
> But it seems that the pinnacle of human intelligence: the greatest, smartest, brightest minds have all come together to... build us another ad engine. What happened to superintelligence and AGI?
While we all knew it was inevitable, I think this quote from the article sums up the feeling nicely.
I don't get that part either. That's like saying "the brightest minds have all come together to... build us another thing that requires money." OpenAI is trying to make more money than they use to run ChatGPT. They are getting that money from their users, and soon advertisers too. They still need their users to like ChatGPT
Literally the first thing you will learn in journalism school is that there is no such thing as "objective neutrality". Even deciding what story to cover includes bias.
Most of that stuff isn't necessary just to replace Plex, the OP's saying them Jellyfin started them on a journey they're presumably enjoying, not that they needed everything there to replace it.
I think you're right the bar is still too high for most folks, although I will note that I think it's dramatically lower than it used to be. A lot of the tools are all-around way easier to deal with, tailscale makes a lot of "personal cloud" use-cases much more feasible, and then coding agents (I'm using claude code) dramatically reduce the labor costs of getting this stuff all working and fixing it when something goes wrong.
Yep you nailed it. That’s all I was saying. None of those things were critical to Jellyfin working.
But I will say for the size of my music library, Jellyfin was not quite as good as plex and was the impetus behind my switch to navidrome for audio.
And navidrome isn’t the best for audiobooks so I’m in the process of testing good audiobook hosting platforms.
So the reply wasn’t wrong either. Plex is just easier for a lot of folks, and that is why I don’t have any ill will towards their changes. They just aren’t for me.
The only two of those you actually need to have a Plex-like setup are Jellyfin and Tailscale, both are trivial to setup and will run on basically any hardware you can imagine wanting to use for this.
It is hard to beat the polish that Plex has. I setup Jellyfin to try it out and I couldn't find a client that was smooth or had the polish of the Plex apps. The AppleTV app was close but then I go down the rabbit hole of codec support. Wanted to like Jellyfin but without a nice looking front end it was a non-starter for me. Good news is you can have the side by side and if a time comes it gets parity with Plex I will be happy to change over.
When I looked for a Plex alternative I settled on Emby. It still has some "premium" features but they're all just QOL, not necessary things. The base app is great and even has handy little features Plex doesn't, and so far, it runs on all the same devices with a much snappier UX on the client side.
Yes, my biggest current gripe is that infuse is a much better client than the first-party app. Otherwise, I'm very happy with it even if it lacks some polish of Plex.
Yeah thats exactly why Im on it. The frontend is fine, maybe a wash compared to Swiftfin last time I tried it out. But for my library, I had frequent issues with codec support on native client vs 0 times on Infuse.
This limitation is exactly why "reasoning models" work so well: if the "thinking" step is not persisted to text, it does not exist, and the LLM cannot act on it.
reply