It's insane that that this works, and that it works fast enough to render at 20 ...

lokimedes · 2024-08-28T04:54:03 1724820843

It makes good sense for humans to have this ability. If we flip the argument, and see the next frame as a hypothesis for what is expected as the outcome of the current frame, then comparing this "hypothesis" with what is sensed makes it easier to process the differences, rather than the totality of the sensory input.

As Richard Dawkins recently put it in a podcast[1], our genes are great prediction machines, as their continued survival rests on it. Being able to generate a visual prediction fits perfectly with the amount of resources we dedicate to sight.

If that is the case, what does aphantasia tell us?

[1] https://podcasts.apple.com/dk/podcast/into-the-impossible-wi...

dbspin · 2024-08-28T08:45:30 1724834730

Worth noting that aphantasia doesn't necessarily extend to dreams. Anecdotally - I have pretty severe aphantasia (I can conjure milisecond glimpses of barely tangible imagery that I can't quite perceive before it's gone - but only since learning that visualisation wasn't a linguistic metaphor). I can't really simulate object rotation. I can't really 'picture' how things will look before they're drawn / built etc. However I often have highly vivid dream imagery. I also have excellent recognition of faces and places (e.g.: can't get lost in a new city). So there clearly is a lot of preconscious visualisation and image matching going on in some aphantasia cases, even where the explicit visual screen is all but absent.

lokimedes · 2024-08-28T10:47:30 1724842050

I fabulate about this in another comment below:

> Many people with aphantasia reports being able to visualize in their dreams, meaning that they don't lack the ability to generate visuals. So it may be that the [aphantasia] brain has an affinity to rely on the abstract representation when "thinking", while dreaming still uses the "stable diffusion mode".

(I obviously don't know what I'm talking about, just a fellow aphant)

dbspin · 2024-08-28T13:21:28 1724851288

Obviously we're all introspecting here - but my guess is that there's some kind of cross talk in aphantasic brains between the conscious narrating semantic brain and the visual module. Such that default mode visualisation is impaired. It's specifically the loss of reflexive consciousness that allows visuals to emerge. Not sure if this is related, but I have pretty severe chronic insomnia, and I often wonder if this in part relates to the inability to drift off into imagery.

drowsspa · 2024-08-28T14:20:08 1724854808

Yeah. In my head it's like I'm manipulating SVG paths instead of raw pixels

zimpenfish · 2024-08-28T10:11:00 1724839860

Pretty much the same for me. My aphantasia is total (no images at all) but still ludicrously vivid dreams and not too bad at recognising people and places.

jonplackett · 2024-08-28T08:32:09 1724833929

What’s the aphantasia link? I’ve got aphantasia. I’m convinced though that the bit of my brain that should be making images is used for letting me ‘see’ how things are connected together very easily in my head. Also I still love games like Pictionary and can somehow draw things onto paper than I don’t really know what they look like in my head. It’s often a surprise when pen meets paper.

lokimedes · 2024-08-28T08:43:37 1724834617

I agree, it is my own experience as well. Craig Venter In one of his books also credit this way of representing knowledge as abstractions as his strength in inventing new concepts.

The link may be that we actually see differences between “frames”, rather than the frames directly. That in itself would imply that a from of sub-visual representation is being processed by our brain. For aphantasia, it could be that we work directly on this representation instead of recalling imagery through the visual system.

Many people with aphantasia reports being able to visualize in their dreams, meaning that they don't lack the ability to generate visuals. So it may be that the brain has an affinity to rely on the abstract representation when "thinking", while dreaming still uses the "stable diffusion mode".

I’m no where near qualified to speak of this with certainty, but it seems plausible to me.

quickestpoint · 2024-08-28T06:58:09 1724828289

As Richard Dawkins theorized, would be more accurate and less LLM like :)

nsbk · 2024-08-28T11:08:53 1724843333

We are. At least that's what Lisa Feldman Barrett [1] thinks. It is worth listening to this Lex Fridman podcast: Counterintuitive Ideas About How the Brain Works [2], where she explains among other ideas how constant prediction is the most efficient way of running a brain as opposed to reaction. I never get tired of listening to her, she's such a great science communicator.

[1] https://en.wikipedia.org/wiki/Lisa_Feldman_Barrett

[2] https://www.youtube.com/watch?v=NbdRIVCBqNI&t=1443s

PunchTornado · 2024-08-28T14:45:57 1724856357

Interesting talk about the brain, but the stuff she says about free will is not a very good argument. Basically it is sort of the argument that the ancient greeks made which brings the discussion into a point where you can take both directions.

stevenhuang · 2024-08-28T06:00:28 1724824828

> It makes me wonder if humans are just next moment prediction machines, with just a little bit more memory built in.

Yup, see https://en.wikipedia.org/wiki/Predictive_coding

quickestpoint · 2024-08-28T06:53:54 1724828034

Umm, that’s a theory.

mind-blight · 2024-08-28T12:50:38 1724849438

So are gravity and friction. I don't know how well tested or accepted it is, but being just a theory doesn't tell you much about how true it is without more info

bangaladore · 2024-08-28T21:59:55 1724882395

> It's insane that that this works, and that it works fast enough to render at 20 fps.

It is running on an entire v5 TPU (https://cloud.google.com/blog/products/ai-machine-learning/i...)

It's unclear how that compares to a high-end consumer GPU like a 3090, but they seem to have similar INT8 TFLOPS. The TPU has less memory (16 vs. 24), and I'm unsure of the other specs.

Something doesn't add up, in my opinion, though. SD usually takes (at minimum) seconds to produce a high-quality result on a 3090, so I can't comprehend how they are like 2 orders of magnitudes faster—indicating that the TPU vastly outperforms a GPU for this task. They seem to be producing low-res (320x240) images, but it still seems too fast.

Philpax · 2024-08-29T02:31:05 1724898665

There's been a lot of work in optimising inference speed of SD - SD Turbo, latent consistency models, Hyper-SD, etc. It is very possible to hit these frame rates now.

dartos · 2024-08-28T15:19:19 1724858359

> It makes me wonder if humans are just next moment prediction machines, with just a little bit more memory built in.

This, to me, seems extremely reductionist. Like you start with AI and work backwards until you frame all cognition as next something predictors.

It’s just the stochastic parrot argument again.

wrsh07 · 2024-08-28T14:22:47 1724854967

Makes me wonder when an update to the world models paper comes out where they drop in diffusion models: https://worldmodels.github.io/

Teever · 2024-08-28T04:52:59 1724820779

Also recursion and nested virtualization. We can dream about dreaming and imagine different scenarios, some completely fictional or simply possible future scenarios all while doing day to day stuff.

mensetmanusman · 2024-08-28T12:22:41 1724847761

Penrose (Nobel prize in physics) stipulates that quantum effects in the brain may allow a certain amount of time travel and back propagation to accomplish this.

wrsh07 · 2024-08-28T14:21:53 1724854913

You don't need back propagation to learn

This is an incredibly complex hypothesis that doesn't really seem justified by the evidence

richard___ · 2024-08-28T06:21:14 1724826074

Did they take in the entire history as context?

slashdave · 2024-08-28T05:11:02 1724821862

Image is 2D. Video is 3D. The mathematical extension is obvious. In this case, low resolution 2D (pixels), and the third dimension is just frame rate (discrete steps). So rather simple.

Sharlin · 2024-08-28T05:15:31 1724822131

This is not "just" video, however. It's interactive in real time. Sure, you can say that playing is simply video with some extra parameters thrown in to encode player input, but still.

slashdave · 2024-08-28T05:32:45 1724823165

It is just video. There are no external interactions.

Heck, it is far simpler than video, because the point of view and frame is fixed.

SeanAnderson · 2024-08-28T05:57:17 1724824637

I think you're mistaken. The abstract says it's interactive, "We present GameNGen, the first game engine powered entirely by a neural model that enables real-time interaction"

Further - "a diffusion model is trained to produce the next frame, conditioned on the sequence of past frames and actions." specifically "and actions"

User input is being fed into this system and subsequent frames take that into account. The user is "actually" firing a gun.

slashdave · 2024-08-28T15:43:05 1724859785

No, I am not. The interaction is part of the training, and is used during inference, but it is not including during the process of generation.

SeanAnderson · 2024-08-28T15:50:17 1724860217

Okay, I think you're right. My mistake. I read through the paper more closely and I found the abstract to be a bit misleading compared to the contents. Sorry.

slashdave · 2024-08-28T16:59:07 1724864347

Don't worry. The paper is not very well written.

psb217 · 2024-08-28T20:57:49 1724878669

Academic authors are consistently better at editing away unclear and ambiguous statements which make their work seem less impressive compared to ones which make their work seem more impressive. Maybe it's just a coincidence, lol.

smusamashah · 2024-08-28T10:29:15 1724840955

It's interactive but can it go beyond what it learned from the videos. As in, can the camera break free and roam around the map from different angles? I don't think it will be able to do that at all. There are still a few hallucinations in this rendering, it doesn't look it understands 3d.

Sharlin · 2024-08-28T13:05:58 1724850358

You might be surprised. Generating views from novel angles based on a single image is not novel, and if anything, this model has more than a single frame as input. I’d wager that it’s quite able to extrapolate DOOM-like corridors and rooms even if it hasn’t seen the exact place during training. And sure, it’s imperfect but on the other hand it works in real time on a single TPU.

hypertele-Xii · 2024-08-28T12:13:38 1724847218

Then why do monsters become blurry smudgy messes when shot? That looks like a video compression artifact of a neural network attempting to replicate low-structure image (source material contains guts exploding, very un-structured visual).

Sharlin · 2024-08-28T12:38:54 1724848734

Uh, maybe because monster death animations make up a small part of the training material (ie. gameplay) so the model has not learned to reproduce them very well?

There cannot be "video compression artifacts" because it hasn’t even seen any compressed video during training, as far as I can see.

Seriously, how is this even a discussion? The article is clear that the novel thing is that this is real-time frame generation conditioned on the previous frame(s) AND player actions. Just generating video would be nothing new.

psb217 · 2024-08-28T21:15:17 1724879717

In a sense, poorly reproducing rare content is a form of compression artifact. Ie, since this content occurs rarely in the training set, it will have less impact on the gradients and thus less impact on the final form of the model. Roughly speaking, the model is allocating fewer bits to this content, by storing less information about this content in its parameters, compared to content which it sees more often during training. I think this isn't too different from certain aspects of images, videos, music, etc., being distorted in different ways based on how a particular codec allocates its available bits.

hypertele-Xii · 2024-08-31T13:10:58 1725109858

I simply cannot take seriously anyone who exclaims that monster death animations are a minor part of Doom. It's literally a game about slaying demons. Gameplay consists almost entirely of explosions and gore, killing monsters IS THE GAME, if you can't even get that correct then what nonsense are we even looking at.

nopakos · 2024-08-28T06:50:32 1724827832

Maybe it's so advanced, it knows the players' next moves, so it is a video!

slashdave · 2024-08-28T18:59:31 1724871571

I guess you are being sarcastic, except this is precisely what it is doing. And it's not hard: player movement is low information and probably not the hardest part of the model.

raincole · 2024-08-28T05:52:45 1724824365

?

I highly suggest you to read the paper briefly before commenting on the topic. The whole point is that it's not just generating a video.

slashdave · 2024-08-28T15:44:01 1724859841

I did. It is generating a video, using latent information on player actions during the process (which it also predicts). It is not interactive.

Sharlin · 2024-08-29T09:11:42 1724922702

Uff, I guess you’re right. Mea culpa. I misread their diagram to represent inference when it was about training instead. The latter is conditioned on actions, but… how do they generate the actual output frames then? What’s the input? Is it just image-to-image based on the previous frame? The paper doesn’t seem to explain the inference part at all well :(

slashdave · 2024-08-29T18:26:22 1724955982

It should be possible to generate an initial image from Gaussian noise, including the latent information on player position

InDubioProRubio · 2024-08-28T07:42:30 1724830950

Video is also higher resolution, as the pixels flip for the high resolution world by moving through it. Swivelling your head without glasses, even the blurry world contains more information in the curve of pixelchange.

slashdave · 2024-08-28T21:26:18 1724880378

Correct, for the sprites. However, the walls in Doom are texture mapped, and so have the same issue as videos. Interesting, though, because I assume the antialiasing is something approximate, given the extreme demands on CPUs of the era.