
Podcasts
Paul, Weiss Waking Up With AI
World Models
In this week’s episode of “Paul, Weiss Waking Up With AI,” Katherine Forrest and Anna Gressel explore the emerging field of AI world models and the many applications of simulating a real-world environment, from interactive media experiences to robotic learning and even weather prediction.
Episode Speakers
Episode Transcript
Katherine Forrest: Welcome, everyone, to another episode of “Paul, Weiss Waking Up With AI.” I’m Katherine Forrest.
Anna Gressel: And I’m Anna Gressel.
Katherine Forrest: And Anna, you know, we’re sitting here in New York only for a very short period of time because you’re about to go away again. But I just want to say that there has been more rain this spring — I’m not, don’t worry. I’m not going to go into another tadpole story. But there has been more rain this spring than you can shake a stick at. And we’re recording this on a day where it’s about to storm all over again.
Anna Gressel: I’m glad that you know that. I feel like I haven’t looked at the forecast today, if I’m being honest, but we’ll see.
Katherine Forrest: Are you in one of those apartments? Still in these Airbnb’s for like, from your fire in your house. So you probably have to like put your neck out the window and like crane to look up at the sky
Anna Gressel: Actually, no, this one has like huge, beautiful windows, but it’s otherwise kind of a quirky apartment. So we’re all adjusting. It’s a period of adjustment.
Katherine Forrest: All right, today, while I am looking out at like this incredible cloudy day with yet more rain, I’m very excited because we’re to be talking about world models. And the world model that I want to talk about actually has a blue sky.
Anna Gressel: Mm, great. Yeah, I mean, this is super exciting as a topic. I know we’ve wanted to cover it for a while. And world models are almost like a new modality for AI. You can kind of think about them that way. We have language models that understand and generate text, audio models that understand and generate speech, video models, of course, that can generate video clips and so on. And world models are meant to understand how the world works and generate data that represents that understanding, and how cool is that?
Katherine Forrest: Okay, so I want to pause on that phrase just for a second, world model, because it’s actually a really super interesting phrase that explains a lot in the words itself of what it is. A world model. So let’s sort of break that apart and bring it down to earth, literally. And one hint at what a world model could be capable of comes from some of the video generation models that we already have like Sora or Veo. These models really are terrific at generating realistic short-form video clips. And to be able to do that successfully, the models have to be trained, and they have to learn certain things about the world, including, for instance, that what goes up comes down, sort of things about like what we would think of as the laws of gravity. And that things that may disappear from view, for instance, if you go around a corner, they’re not out of existence, they’re just out of view. So a world model is a model that is trained to be capable of accurately simulating a world environment. And the worlds that we’re talking about here could be real worlds that are photorealistic real worlds of real places, or they could be virtual worlds that are either photo realistic or cartoonish.
Anna Gressel: Yeah, I remember one of the first Sora demos that came out actually had a video clip that was completely generated from scratch of a black cat walking across windowsills in what looked like Cape Town. And at one point, the cat slips and like rights itself and just keeps walking on the windowsills. And I remember being like, that was the most lifelike gesture to kind of lose balance and then gain balance. And it was amazing to me that that model could actually generate something like that, that had that kind of, you know, just eerie realness to it. It was really kind of quite a moment, I think, when those demo videos came out, and that was, that was a while ago now.
Katherine Forrest: Yeah, and actually let me just give you my little demo video, because when I first saw it…I think a lot of people in our audience may have seen the one where there was a clip, nighttime, where there was a woman walking across a really rainy street, and the rain is coming down, and there’s literally a kind of moistness on the street that was incredibly photorealistic. And the way that she walked, the way that the rain fell, all of that was taking advantage of some of these world model capabilities.
Anna Gressel: And these capabilities, I think it’s worth noting, really come, at least in part, through substantial progress that’s been made in the video game space. And two examples are DeepMind’s Genie 2 and Microsoft’s Muse. These are both generative models of video games. That means they’re able to generate realistic sequences of gameplay visuals complete with controller or keyboard input actions that cause each of the individual changes that are visible on the screen. And that’s like a really exciting demonstration that wasn’t really thought possible until these models came out.
Katherine Forrest: Well, I think you’re right, it wasn’t thought possible. You know, I did a, one of my books is called “Is Justice Real When ‘Reality’ Is Not?,” and it’s really about virtual worlds and VR and all kinds of interesting things. But it didn’t anticipate, I think, the speed at which some of this is happening. And beyond the coolness factor, why, you might ask, do we need AI models that can generate gameplay when I can just go, for instance, and play the game myself? Because some of these world models can actually generate gameplay, and simulating gameplay is really just the first step towards being able to simulate real world environments, and the applications there are truly endless. And if we stick for a moment in the media domain, one use for world models is to create these interactive experiences, and actually, I talked about that in that book. And imagine a movie that’s being generated just for you, or in which you personally play one of the parts of the movie. You know, Sam Altman of OpenAI expressed this best in a, or at least he expressed it well, in a tweet from last year stating that, “movies are going to become video games and video games are going to become something unimaginably better.” And that, I think, is a real statement about some of the expected advances in these world models.
Anna Gressel: Yeah, and I think, in the commercial or industrial space, another application for world models — so, again, these are models that understand how the world works — is embodied agents like robots being able to predict and simulate how certain actions impact their environment could help future robots learn to plan better and reason with highly complex real-world environments. And we’re actually seeing developments and research in this area today.
Katherine Forrest: Right, it’s actually worth mentioning that world models come up really frequently in the context of agents or agentic AI, generally. That is, if for an agent to be able to reason well, to make decisions, these are AI agents, it has to have some kind of understanding of its world or the environment that it’s in from world models, so that it can actually apply common sense. So narrow world models come up very frequently for agents, but what’s been missing and what is now really developing is kind of a foundation scale, meaning very large scale, generalist world model that can give agentic AI truly a sense of how the world works. And that’s really what we’re talking about right here.
Anna Gressel: Yeah, and for that reason, world models can be incredibly useful for training agentic AI. It’s not just giving the agent a world model to use, but actually the world model can help train future agents. It’s almost like a practice terrain that the agents can learn to navigate, it can help them understand what the world is like and how the world works.
Katherine Forrest: Right, and so let’s talk for just a second about who’s working on world models and where our audience should be looking for advances on world models.
Anna Gressel: Yeah, well, we mentioned DeepMind and Microsoft a few moments ago. It’s also worth tracking progress of various video diffusion models, which we also mentioned, but those include one from OpenAI, Runway has research into world models and DeepMind as well. But a few additional startups have thrown their hat squarely in the ring of world models, and one is Fei-Fei Li’s World Labs.
Katherine Forrest: Right. Fei-Fei Li, in case any of our listeners don’t know, is a truly world-famous AI researcher. She created the ImageNet data set, which ultimately led to the rebirth of deep learning in the 2010 era. She is an extraordinarily well-known and well-respected AI, not only researcher, but developer.
Anna Gressel: Yep, and World Labs has a very exciting interactive media demo that allows you to translate 2D photographs into 3D environments that can be explored.
Katherine Forrest: Right, and that 3D environment starts to then look like a world, a three-dimensional world, like the one that we live in. So world models, even as we think of them with Sora or with the beginning of 3D, that’s not likely to be the end of the development. There’s going to be a lot happening because it’s one thing to understand gravity and object permanence, but to be truly accurate and sophisticated as a simulator of real-world physics, that takes additional advancements. That’s in fact the focus of a new team at DeepMind that was formed in January of this year after the release of DeepMind’s Genie 2 that is coming up with having a sophisticated simulator of real world physics.
Anna Gressel: Yeah, you can think about this in an interesting way by thinking about current models as using language to describe physics or using images to describe physics. That’s kind of what LLMs and diffusion models do. But true, general world models may actually use some understanding of physical properties to describe physics Referring back to kind of that earlier episode we did on protein folding, we’re moving more towards a world in which we have models that truly can represent scientific physical properties of the world. And that just unlocks so many possibilities, but it also could be conceived of as a next step on the path towards AGI or artificial general intelligence, which we’ve talked about before, but could include AI that can perform a huge variety of intellectual tasks that humans can perform.
Katherine Forrest: Right. I mean there is an entire group of researchers, and not associated even with one another, that really believe that world models are the critical missing component in the quest for AGI, that in order to get to AGI, AI has to understand how the world really works in a true physical sense. Otherwise, it won’t be able to apply the common sense that babies have when they learn, for instance, that when you push your food to the end of your table, it’s going to fall off and fall on the floor and all your Cheerios go scattering everywhere else. So that physical awareness of the world will be that sort of missing link. And in fact, one of the job listings for that new team that I mentioned, that DeepMind team that was quoted in TechCrunch, stated that, “we believe scaling on video and multimodal data is on the critical path to artificial general intelligence.” So it’s a recognized issue in the field.
Anna Gressel: Yeah. If you think about everything that AI is capable of today, really adding this kind of world model capability has endless possibilities that it could unlock for all different kinds of applications. And to get truly, truly far-flung and a bit sci-fi, imagine sending some sort of agent-manned mission to Mars, where it would have to adapt to new physical worlds. But to do that, you could actually imagine training and modeling that using some sort of virtual world with true physical properties. It’s kind of like a digital twin, almost, for the world, like something where you can simulate what might happen before you actually go into a lot of preparation. And this is not so different from the kind of modeling that people do that uses a huge, huge amount of human cognition and skill and hypotheticals and testing. But if you have a world model undergirding that or scaffolding that, that can be really, really helpful in advancing the state of the science.
Katherine Forrest: Right, and it can even be used for predictive purposes, for things even in an area that you might not think of as needing a world model, but like a weather system. That’s another problem that DeepMind has been working on, especially with the release of their WeatherNext technology. So it’s really interesting to stop and ask ourselves, what would it mean to have an agent that has full understanding of the most complex, the most granular details of our physical world? Really interesting stuff, Anna, but that’s all we’ve got time for today. I’m Katherine Forrest.
Anna Gressel: And I’m Anna Gressel. Like and subscribe to the podcast if you’re enjoying it.