I’m not sure how that would help vs just training the model with the conditionin...

foota · 2024-08-28T23:55:55 1724889355

I'm not certain where it would fit in, but my thinking is this.

There's been a bunch of work on making splats efficient and good at representing geometry. Reading more, perhaps NERFs would be a better fit, since they're an actual neutral network.

My thinking is that if you trained a NERF ahead of time to represent the geometry and layout of the levels, and plug that in to the diffusion model (as a part of computing the latents, and then also on the other side so it can be used to improve the rendering) then the diffusion model could focus on learning how actions manipulate the world without having to learn the geometry representation.

Chance-Device · 2024-08-29T00:26:11 1724891171

I don’t know if that would really help, I have a hard time imagining exactly what that model would be doing in practise.

To be honest none of the stuff in the paper is very practical, you almost certainly do not want a diffusion model trying to be an entire game under any circumstances.

What you might want to do is use a diffusion model to transform a low poly, low fidelity game world into something photorealistic. So the geometry, player movement and physics etc would all make sense, and then the model paints over it something that looks like reality based on some primitive texture cues in the low fidelity render.

I’d bet money that something like that will happen and it is the future of games and video.

foota · 2024-08-29T01:04:51 1724893491

Yeah, I realize this will never be useful for much in practice (although maybe as some kind of client side prediction for cloud gaming? But likely if you could run this in real time you might as well run whatever game there is in real time as well, unless there's some kind of massive world running on the server that's too large to stream the geometry for effectively), I was mostly just trying to think of a way to avoid the issues with fake looking frames or forgetting what the level looks like when you turn around that someone mentioned.

Not exactly that, but Nvidia does something like this already, they call it DLSS. It uses previous frames and motion vector to render a next frame using machine learning.