> 860M UNet and CLIP ViT-L/14 (540M)
Checkpoint size:
4.27 Gb
7.7 GB (full EMA)
Running on a TPU-v5e
Peak compute per chip (bf16) 197 TFLOPs
Peak compute per chip (Int8) 393 TFLOPs
HBM2 capacity and bandwidth 16 GB, 819 GBps
Interchip Interconnect BW 1600 Gbps
This is quite impressive, especially considering the speed. But there's still a ton of room for improvement. It seems it didn't even memorize the game despite having the capacity to do so hundreds of times over. So we definitely have lots of room for optimization methods. Though who knows how such things would affect existing tech since the goal here is to memorize.
What's also interesting about this work is it's basically saying you can rip a game if you're willing to "play" (automate) it enough times and spend a lot more on storage and compute. I'm curious what the comparison in cost and time would be if you hired an engineer to reverse engineer Doom (how much prior knowledge do they get considering pertained models and visdoom environment. Was doom source code in T5? And which vit checkpoint was used? I can't keep track of Google vit checkpoints).
I would love to see the checkpoint of this model. I think people would find some really interesting stuff taking it apart.
Those are valid points, but irrelevant for the context of this research.
Yes, the computational cost is ridicolous compared to the original game, and yes, it lacks basic things like pre-computing, storing, etc. That said, you could assume that all that can be either done at the margin of this discovery OR over time will naturally improve OR will become less important as a blocker.
The fact that you can model a sequence of frames with such contextual awareness without explictly having to encode it, is the real breakthrough here. Both from a pure gaming standpoint, but on simulation in general.
I suppose it also doesn't really matter what kinds of resources the game originally requires. The diffusion model isn't going to require twice as much memory just because the game does. Presumably you wouldn't even necessarily need to be able to render the original game in real time - I would imagine the basic technique would work even if you used a state of the Hollywood-quality offline renderer to render each input frame, and that the performance of the diffusion model would be similar?
Well the majority of ML systems are compression machines (entropy minimizers), so ideally you'd want to see if you can learn the assets and game mechanics through play alone (what this paper shows). Better would be to do so more efficiently than that devs themselves, finding better compression. Certainly the game is not perfectly optimized. But still, this is a step in that direction. I mean no one has accomplished this before so even with a model with far higher capacity it's progress. (I think people are interpreting my comment as dismissive. I'm critiquing but the key point I was making was about how there's likely better architectures, training methods, and all sorts of stuff to still research. Personally I'm glad there's still more to research. That's the fun part)
>you could assume that all that can be either done at the margin of this discovery OR over time will naturally improve OR will become less important as a blocker.
OR one can hope it will be thrown to the heap of nonviable tech with the rest of spam waste
1) the model has enough memory to store not only all game assets and engine but even hundreds of "plays".
2) me mentioning that there's still a lot of room to make these things better (seems you think so too so maybe not this one?)
3) an interesting point I was wondering to compare current state of things (I mean I'll give you this but it's just a random thought and I'm not reviewing this paper in an academic setting. This is HN, not NeurIPS. I'm just curious ¯ \ _ ( ツ ) _ / ¯)
4) the point that you can rip a game
I'm really not sure what you're contesting to because I said several things.
> it lacks basic things like pre-computing, storing, etc.
It does? Last I checked neural nets store information. I guess I need to return my PhD because last I checked there's a UNet in SD 1.4 and that contains a decoder.
1) yes you are correct. the point i was making is that, in the context of the discovery/research, that's outside the scope, and 'easier' to do, as it has been done in other verticals (ie.: e2e self driving)
2) yep, aligned here
3) I'm not fully following here, but agree this is not NeurIPS, and no Schmidhuber's bickering.
4) The network does store information, it just doesn't store a gameplay information, which could be forced, but as per point 1, it is , and I think it is the right approach, beyond the scope of this research
1) I'm not sure this is outside scope. It's also not something I'd use to reject a paper were I to review this in a conference. I mean you got to start somewhere and unlike reviewer 2 I don't think any criticism is rejection criteria. That'd be silly since lack of globally optimal solutions. But I'm also unconvinced this is proven my self-driving vehicles but I'm also not an RL expert.
3) It's always hard to evaluate. I was thinking about the ripping the game and so a reasonable metric is a comparison of ability to perform the task by a human. Of course I'm A LOT faster than my dishwasher at cleaning dishes but I'm not occupied while it is going, so it still has high utility. (Someone tell reviewer 2 lol)
4) Why should we believe that it doesn't store gameplay? The model was fed "user" inputs and frames. So it has this information and this information appears useful for learning the task.
>What's also interesting about this work is it's basically saying you can rip a game if you're willing to "play" (automate) it enough times and spend a lot more on storage and compute
That's the least of it. It means you can generate a game from real footage. Want a perfect flight sim? Put a GoPro in the cockpit of every airliner for a year.
> Want a perfect flight sim? Put a GoPro in the cockpit of every airliner for a year.
You're jumping ahead there and I'm not convinced you could do this ever (unless you're model is already a great physics engine). The paper itself has feeds the controls into the network. But a flight sim will be harder better you'd need to also feed in air conditions. I just don't see how you could do this from video alone, let alone just video from the cockpit. Humans could not do this. There's just not enough information.
There's an enormous amount of information if your GoPro placement includes all the flight instruments. Humans can and do predict aircraft state t+1 by parsing a visual field that includes the instruments; that is what the instruments are for.
Plus, presumably, either training it on pilot inputs (and being able to map those to joystick inputs and mouse clicks) or having the user have an identical fake cockpit to play in and a camera to pick up their movements.
And, unless you wanted a simulator that only allowed perfectly normal flight, you'd have to have those airliners go through every possible situation that you wanted to reproduce: warnings, malfunctions, emergencies, pilots pushing the airliner out of its normal flight envelope, etc.
The possibility seems far beyond gaming(given enough computation resources).
You can feed it with videos of usage of any software or real world footage recorded by a Go Pro mounted on your shoulder(with body motion measured by some sesnors though the action space would be much larger).
Such a "game engine" can potentially be used as a simulation gym environment to train RL agents.
wouldnt make more sense to train using microsoft flight simulator the same way they did DOOM, but im not sure what the point is if the game already exists
What's also interesting about this work is it's basically saying you can rip a game if you're willing to "play" (automate) it enough times and spend a lot more on storage and compute. I'm curious what the comparison in cost and time would be if you hired an engineer to reverse engineer Doom (how much prior knowledge do they get considering pertained models and visdoom environment. Was doom source code in T5? And which vit checkpoint was used? I can't keep track of Google vit checkpoints).
I would love to see the checkpoint of this model. I think people would find some really interesting stuff taking it apart.
- https://www.reddit.com/r/gaming/comments/a4yi5t/original_doo...
- https://huggingface.co/CompVis/stable-diffusion-v-1-4-origin...
- https://cloud.google.com/tpu/docs/v5e
- https://github.com/Farama-Foundation/ViZDoom
- https://zdoom.org/index