Podcasts

Paul, Weiss Waking Up With AI

Paul, Weiss Waking Up with AI

Blueprints for Brains: The Architecture of Intelligence

In this episode, Katherine Forrest and Scott Caravello break down three generative AI architectures—transformers, JEPA, and diffusion models—exploring what sets each apart and how they overlap. They also discuss Manifold-Constrained Hyper-Connections, a recent innovation aimed at improving how transformer layers communicate during training.

For the sources referenced in this episode, please see the links below:

DeepSeek AI: mHC: Manifold-Constrained Hyper Connections

Stream here or subscribe on your
preferred podcast app:

Episode Speakers

Katherine B. Forrest

Partner

New York

Tel: +1-212-373-3195

kforrest@paulweiss.com

Scott Caravello

Associate

New York

Tel: +1-212-373-3489

scaravello@paulweiss.com

Episode Transcript

Katherine Forrest: Hello everyone and welcome to today's episode of Paul, Weiss Waking Up With AI. I'm Katherine Forrest.

Scott Caravello: And I'm Scott Caravello. Katherine, how are you?

Katherine Forrest: Oh, wow. That was like… that was really nice. How am I? Yeah, all right.

Scott Caravello: Of course. I'm curious. I'm interested.

Katherine Forrest: So there are days when I feel like I've been shot out of a cannon. Have you ever had those days?

Scott Caravello: I think I might also be having one of those days if that's where this is going. So yeah.

Katherine Forrest: Yeah, that's where this is going. So, today is a day that I feel like I've been shot out of a cannon. And I don't think I have a trampoline on the other end. I'm really worried that I have an arc. My cannon arc is just being shot out of the cannon and maybe I land on a treetop or something if I'm lucky. So it's busy. It's busy.

Scott Caravello: I'm rooting for you. It's good. Busy is good though, you know. We're plugging along. We're doing the AI thing. It's great.

Katherine Forrest: We're doing the AI thing and boy, I am getting so many questions, by the way, on Mythos and that's not today's episode, but I was studying and then going back and reading and rereading the system card. Wait, and I have to tell our readers about one thing in the system card for Mythos that before we get to our JEPA episode— this is the JEPA episode everybody, don't turn that dial—but I have to read something because it's really interesting and we didn't cover it in our Mythos episode last time, but it's on page 11 of the Anthropic system card dated April 7th, 2026 for the Claude Mythos preview. And it says, “we remain deeply uncertain about whether Claude has experiences or interests that matter morally and about how to investigate or address these questions. But we believe it is increasingly important to try.” So, I just want people to pause on that as a statement coming from a major, major, one of the major model developers. But anyway, that's what you get when you ask me how am I? So, you know, we had talked a couple of weeks ago about model architectures and in particular about world model architectures. And we talked about doing an episode on JEPA, which is J-E-P-A, all caps. And I said, we promised this episode a couple of weeks ago. And then last week I said we'd follow it up. And then we had the late breaking news of the OpenAI industrial policy. But here we are, we're going to cover JEPA today. And I think it's worth also doing a little bit with diffusion models because we haven't done so much with that. So we'll be comparing those to what is the classic transformer architecture, which is the main architecture that we all think about when we're talking about the Anthropic Claude models or the ChatGPT models or the majority of the Llama models, et cetera, et cetera, the Gemini models. So let's go for it.

Scott Caravello: And, then I think, time permitting, we can also touch on a somewhat recent, it's from 2026, recent and exciting innovation in transformer architectures, which is called “Manifold-Constrained Hyper-Connections,” or mHC.

Katherine Forrest: Little “m,” capital “H,” capital “C.” Manifold-Constrained Hyper-Connections… like, what kind of word is that?

Scott Caravello: Well, we will explain a little bit about it and what hyper-connections are and then what the manifold constrained component of it does. So I'm not going to give that away too early.

Katherine Forrest: It's all going to become clear, okay. And, so, we'll actually link the paper that describes the mHC to this episode. And that's actually the architecture that's rumored to be underpinning DeepSeek's forthcoming R2 model. But anyway, before we get down to those complexities of mHC, the Manifold-Constrained Hyper-Connections architecture. Let's start talking about JEPA and get down to basics. Let's start off with, you know, sort of going over what we mean by architecture.

Scott Caravello: Yeah, great call. So, I guess I would describe an architecture as the model's design. And the best way to think about it would probably be like a blueprint. The architecture that's used determines the components of a model, which in turn determine how data is processed and transformed when the model's being trained and when it's generating outputs. It affects what patterns the model learns and how the model learns them.

Katherine Forrest: Right, and so when folks talk about generative AI models, a lot of the time they're really talking about, as I said, these transformer models using the transformer architecture. And there are different flavors of even transformer architecture. One of the many that we've talked about is called “Mixture of Experts,” or capital “M,” little “o,” capital “E.” That's in fact the way they actually shorten it to MoE.

Scott Caravello: This is our new style. I like it.

Katherine Forrest: And in a mixture of experts, or MoE, that's where only certain parts of a model are activated based on the user's query. So you often have a router in the middle, sort of like a hub and spoke, where you've got a transformer architecture series of models, or you've got a series of models based on the transformer architecture. And then you've got sort of a router in the middle or a traffic cop in the middle that's sending the particular parts of a query to the aspects of the model most likely to be really good at dealing with that query. So that's the MoE.

Scott Caravello: And I think that point about mixture of experts really brings my description about architectures generally to life, right? Because the architecture is determining how the model's operating so that here with the capital “M,” little “o,” capital “E,” the right neurons are activating to handle an input. But with that and on to JEPA, which is different than a transformer, our listeners will recall from our World Models episode that Yann LeCun, formerly of Meta and now the founder of AMI Labs, as well as one of the so-called godfathers of AI, thinks that transformers are going to hit a wall and that JEPA is the future.

Katherine Forrest: Right. And, you know, we don't actually know, think, necessarily which model is ultimately going to prove to be the most powerful model, whatever most powerful means, whether that's in a single domain or multiple domains. I don't know that we have the answer to that yet. But one of the things that Yann LeCun had originally talked about was whether or not transformer architecture could actually reach super intelligence. And his view, at least, was that JEPA was far more likely to be able to hit that sort of level than transformer architectures. And as you know, I've often said that there's a difference between ultimate capabilities of a model and what might be enough capability to cross a particular threshold. And I actually believe, particularly what we're seeing right now with Mythos, which is transformer architecture, that when it comes really to achieving AGI or super intelligence, that we will achieve that with transformer architecture, that the models are just getting better and better and better. That does not mean that JEPA won't have certain extraordinary capabilities that we need to pay attention to and that may exceed some of those of the transformer architecture, but it does mean that it's not just one or the other. You don't have to choose a single horse.

Scott Caravello: That makes total sense. And, so, then I think we can continue on this JEPA road and explain why exactly it is though that LeCun thinks that JEPA is the missing piece of the puzzle. And, too do that, I'll start by folding in the briefest of recaps on world models generally. Last time we spoke about world models as an approach toward teaching AI the dynamics of the real world. It's how AI could understand, for instance, that Humpty Dumpty can't just be put back together again after his big fall.

Katherine Forrest: And by the way, I have to tell you that I love that Humpty Dumpty has now made its way into the free, you know, when we talk about AI, because I introduced this Humpty Dumpty concept and I did not write, I was going to try to say that I had written Humpty Dumpty, but like we all knew that was not true. All right. Anyway The thing about world models is talking about the way the world works in terms of physics. And for the Humpty Dumpty, it's that all the king's horses and all the king's men, and now it's both horses and probably some mechanical devices, as well as all of the king's men and women, can't put Humpty Dumpty back together again. And with world models, what you're suggesting is that the AI model will come to understand the laws of physics so that when something is pushed off of a table, the model will predict that it's gonna fall to the ground rather than go up in the air unless it's a helium balloon. And, so, in that way, it models common sense—the common sense of the physical world around us.

Scott Caravello: Yeah, and so that's exactly what LeCun is aiming for with JEPA, a way to build AI with that common sense.

Katherine Forrest: Right, and to explain JEPA, it helps to say a few words about generative models like the transformer model. Because the job of any generative model, no matter what it's generating, is to turn the input data, the training data, you know, for instance, like a token, because in a transformer model, what you've got is chunks of data or tokens, and it turns that into a whole series of information in the related information inside the neural network that then can be accessed for the purposes of creating output. So, the machinery of a generative model is all about the computation that's happening under the hood, that's taking in the data, that's relating pieces of data to one another, and then it's creating a mathematical representation of data. And, so, while you're transforming data into tokens, you're also then creating essentially strings of numbers and then those are placed in sort of the vector space or what I have often called the inside of the cube. And it will be put in one place related to another piece of information, and these are the parameters, and they can be changed as new information is learned, and the parameters become related to yet, more information or differently related. They can sort of switch around a little bit. So, they encode through these numerical representations and preserve some of the semantic relationships of the input data.

Scott Caravello: Right. And when we say semantic relationships, we're talking about the relationships that you mentioned that are between and among words in all sorts of ways. So, like the fact that the word “dog” and the word “cat” have a connection and that's modeled. But “dog” and “leash,” and “dog” and “man's best friend” also have connections. But those two things aren't as closely related to cats. So, it'll be closer to dogs in the vector space than it would be to cats. And if you put your cat on a leash, I'm sorry, but “cats” and “leash”…

Katherine Forrest: There are a lot of people who put cats on a leash. I see it on Instagram all the time.

Scott Caravello: Okay. Well, hopefully I don't get any hate mail for this, but, anyway, all of these things, all of these connections are represented in the vector space that's modeling the relationships among the language.

Katherine Forrest: But in generative models, these representations are a step along the way to being called upon or accessed to create generating outputs. But what JEPA does, now we're getting to JEPA, and we're going to contrast it to this transformer architecture, which has got these chunks or tokens of data that go into the vector space. They get related to one another. Now with JEPA, it actually predicts the mathematical representations themselves.

Scott Caravello: And that's really key to this whole story.

Katherine Forrest: Right. And, so, I actually want to give, sort of, an example of a JEPA model that is, sort of, talks to you a little bit about, or talks about, the training. So, with a JEPA model, you might input, for instance, a variety of images, a variety of video, and you are going to essentially hide one piece of an image and reveal another piece. So, let's just assume for the moment that you take a picture of a dog, an image of a dog, that goes into the training corpus for a JEPA model. So, with the JEPA model, what you've got is this sort of half an image, and it's not really half, it's a portion of an image with another portion of an image. One portion is hidden and the other is revealed and the model is predicting the unrevealed piece. So, like a generative model, you're dealing with unpredictability. And, you know, like a generative model, you're also trying to model the world, but with JEPA, you're modeling the world with real images. So you want to predictably model that, you know, if you've got a glass on a table and then the moving image pushes the glass off, but then what happens is hidden. After you've pushed the glass off, the JEPA model would predict that the glass would fall down. And that hidden portion of here, we're talking about sort of a video, that hidden portion is sort of the model learning to understand the real world.

Scott Caravello: Yeah, but then as it does so, again sort of contrasting this with the generative piece, JEPA is not focused on recreating every single detail like it would if you were generating the visual representation of a cup falling, where it would capture all the details, right? The exact glass it's made of, the color of the glass. And part of that is because it's working in predicting the mathematical representations of these events in the way that the world works.

Katherine Forrest: Exactly. And there's also an efficiency angle because the model isn't focused on every single detail. And, so, as a result, that lack of focus on every single detail can potentially add to a cheaper and even potentially a faster model.

Scott Caravello: Right, and, you know, still just sort of taking one step back and talking about the fact that JEPA isn't focusing on all of the details in the way that a generative model is. You said it before, Katherine, and I think it's another key caveat to throw in here, but that the details are also important. If we want the full Humpty Dumpty story, it's the generative model that does the trick.

Katherine Forrest: Right, and so moving on to our next architecture, which does overlap with the generative AI world. Actually, let me stop for a second, because I want to make sure we haven't confused people. You know, transformer architectures and JEPA architectures are both generative AI architectures. So, when we're talking about generative AI architecture, we're not talking about just transformer or leaving out JEPA. We're including both JEPA and transformer architectures as AI generative architectures. But now, let's talk about diffusion models. Because diffusion models are both a different kind of architecture, but they're also a different kind of model altogether. And while they can also be generative, because they generate content, they're sort of a different bird or bird of a different feather. And, so, when people hear the phrase or the words diffusion or stable diffusion, their minds often jump to DALL-E, D-A-L-L, big “E,” you know, the OpenAI model that was subject to some legal wrangling and is still subject to some legal wrangling.

Scott Caravello: Yeah, and actually for a lot of people, DALL-E 2 was their first introduction to OpenAI and generative AI, a bit earlier in 2022 before ChatGPT launched. Like you said, the technology is often thought of in connection with image generation, and that's often text to image generation, but it's not always the case.

Katherine Forrest: And that's right. Let's briefly pause on what diffusion is. So again, we're talking about three different kinds of generative AI models, but with three different kinds of architectures. The transformer architecture, the JEPA architecture, and by the way, there's also a V-JEPA architecture, which is a video joint embedding predictive architecture. That's what the JEPA stands for. And now we're talking about the diffusion architecture, but all of them are forms of generative AI models, different models to do essentially content generation.

Scott Caravello: Putting AI itself aside for a second, talking about what diffusion is in the general sense of the word. It's a physical process, right? It's the random movement of particles from regions of higher concentration to lower concentration. So, say you place a sugar cube, which is, you know, a little region of highly concentrated sugar, into a cup of tea, and so it gradually diffuses throughout the whole cup and makes every sip just as sweet as the next.

Katherine Forrest: Hold on, hold on. I just want to know. You're going to call a sugar cube… a region of highly concentrated sugar? That's what you call a sugar cube? How about just like squished sugar into a cube, right? Like a region? What is this?

Scott Caravello: I'm trying to connect it back to the AI concepts in a moment. If you'll grant me the grace to continue, I think it will become clear.

Katherine Forrest: Okay, all right, all right. You're just trying to sound fancy. Your region of highly concentrated sugar, which now gives you a sip that is just as sweet as the next, you may continue.

Scott Caravello: Oh boy, okay, and so, for a diffusion model, it takes the training data, which could be an image, which is a highly concentrated group of pixels, see, and then diffuses it into random noise like a picture of static. See, how beautifully that whole spiel connected from region of concentrated sugar? And, but, obviously it's much more complex than we can explain here, but the model is trained to predict that noise that's added to the image to get to the static state, which then allows the model when it's operating to reverse the process in response to a user prompt, re-concentrating it until you're left with a coherent image that resembles what the user asked for.

Katherine Forrest: Okay, so I'm going to take your region of highly concentrated sugar. I've put it into my cup so that I have my taste is one sip as sweet as the next. And so what's really happening is I'm reconstituting my sugar cube.

Scott Caravello: Yeah, there you go.

Katherine Forrest: I’ve reconstituted the region. So, with the diffusion model, there is gradually an output of entire pictures or frames. It can be an image or it can be actual video and sometimes even text all at once. And it sort of comes into focus when we say all at once, it starts to come into focus all at once. And then it actually then you see it start to actually focus more and more and better and better all at once. And we can revisit text diffusion in a bit because that's fairly interesting, it's relatively new, and it deserves its own separate little story. But in terms of image diffusion, where is it now? Right now, image diffusion is used in those state of the art video generation models that we're seeing like Seed Dance 2.0, OpenAI's previous, its old model Sora, which is no longer sort of actively being used. And those models are built on what's called a “diffusion transformer architecture,” which is really a hybrid of both the diffusion model and a transformer architecture. It uses the transformer backbone in connection with the diffusion process to help give the model a better understanding of the world and of physics because the model can pay attention to different pixels and different frames all at the same time. So it can understand that a ball moving through the air is actually following an arc and not zigzagging around.

Scott Caravello: And what that hybrid architecture allows the model to do when it's learning is focus not even necessarily on the entire frame, the entire image, but on regions of the frame, the image. So, I wanted to bring it back one more time to appropriately use the word “region” in this context.

Katherine Forrest: Yeah, right. I just want to make sure that our audience now sort of understands what we've talked about. We've talked about diffusion models like DALL-E, and then we've actually now added in something a little bit different, which is a diffusion transformer model, which is a video generation model that is like a, for instance, Sora model or Seed Dance 2.0, which is almost like a hybrid. DiT is, I think, the acronym for that.

Scott Caravello: Yeah. I think big “D,” little “i,” big “T.”

Katherine Forrest: That brings us to little “m,” big “H,” big “C.”

Scott Caravello: Exactly. Do you want to kick that one off, Katherine?

Katherine Forrest: Sir, I do not.

Scott Caravello: Oh. Okay, fine. So, this is about an innovation in transformer architectures, but really it's about the connection between layers of a transformer. And as, Katherine, you mentioned before, it's speculated that this technique was used and could be central to DeepSeek's forthcoming R2 model. And there's a whole technical story about how we get to that paper, but we can just give a high-level overview of the method today.

Katherine Forrest: Right. And so it's all about what's underneath the hood. We’ve got, like, neural networks, like, transformers, which are a bunch of interconnected layers that each do some computation and then communicate their results to the next layer. And when the networks are being trained, there has to be a backwards communication through the layers. And the way that these layers are connected to each other impacts how well and how fast each individual layer learns or gets trained.

Scott Caravello: Right. And, so, then the transformers are typically built with what's called a residual connection, which facilitates that communication between the different layers so that they can respond and adjust their weights during training and improve outputs. And, so, without this residual connection, communicating how the weight should be adjusted is a little bit like a game of telephone. The latest layers, by which I mean the layers that are closest to the final output, get a great signal, but the earlier ones, or the earliest ones, get almost nothing at all. So, the residual connection makes sure that every layer gets the message.

Katherine Forrest: Okay, so again, people, what we're talking about, this mumbo jumbo here, is Manifold-Constrained Hyper-Connections. It's by DeepSeek AI, and it is a form of the transformer architecture. So, what we're doing is we're sort of going around a little bit here in the circle and giving you a new development in how, uh, some advances are being made in the layers within the neural network of the transformer model are connecting. So, you've got these hyper-connections that are opening up lines of communication between the layers inside the transformer neural network that never existed before, and that allows information to flow between the layers more efficiently and therefore allows more complex reasoning. But that's not the big innovation here. We've got in this sort of paper, the mHC, we've got Manifold-Constrained Hyper-Connections. And Manifold-Constrained Hyper-Connections are something special because just regular old hyper-connections can make AI training unstable. But these Manifold-Constrained Hyper-Connections actually assist in shaping the connections between the layers in the neural network that works to keep the training stable, and you get the benefits of those hyper-connections throughout the training process.

Scott Caravello: Very well put. That is some very, very technical stuff, and I appreciate you breaking it down like that. But, so, this has been a fun and enlightening overview of all that's underway in the architecture of AI, which is perhaps not discussed as often as it should be, but I think that that is all the time that we have for today. I'm Scott Caravello.

Katherine Forrest: And I'm Katherine Forrest. Remember to like and subscribe.

View Full Transcript