Podcasts

Paul, Weiss Waking Up With AI

Paul, Weiss Waking Up with AI

Confessions of a Large Language Model

In this episode, Katherine Forrest and Scott Caravello unpack OpenAI researchers’ proposed “confessions” framework designed to monitor for and detect dishonest outputs. They break down the researchers’ proof of concept results and the framework’s resilience to reward hacking, along with its limits in connection with hallucinations. Then they turn to Google DeepMind’s “Distributional AGI Safety,” exploring a hypothetical path to AGI via a patchwork of agents and routing infrastructure, as well as the authors’ proposed four layer safety stack.

Stream here or subscribe on your
preferred podcast app:

Episode Speakers

Katherine B. Forrest

Partner

New York

Tel: +1-212-373-3195

kforrest@paulweiss.com

Scott Caravello

Associate

New York

Tel: +1-212-373-3489

scaravello@paulweiss.com

Episode Transcript

Katherine Forrest: Welcome back to Paul, Weiss Waking Up with AI. I'm Katherine Forrest.

Scott Caravello: And I'm Scott Caravello.

Katherine Forrest: And Scott, where are you? You're in some place that may not actually be daylight outside or something. I mean, it's like a deep, dark dungeon that you're in.

Scott Caravello: No, no, I am just very much front-lit in New Orleans. Thought I'd pop down here for the weekend and, you know—yeah, yeah, you might be saying, Scott, that's an interesting destination for January. And I would say, Katherine, yes! you're right. It's interesting. But it's good meeting some friends here. It'll be nice.

Katherine Forrest: Fun! Yeah!

Scott Caravello: You're right. It's interesting. But it's good meeting some friends here. It'll be nice.

Katherine Forrest: Oh, that's fun. Is it like a bachelor party?

Scott Caravello: Yeah. Yeah, yeah.

Katherine Forrest: Not for you, but for somebody else.

Scott Caravello: No, no, no, not for me. For a wedding—I'm the best man in, in March. So, it'll be fun.

Katherine Forrest: Wait, it's really a bachelor party?

Scott Caravello: Yeah, yeah, it actually is. I was not gonna say that!

Katherine Forrest: All right, I hope it's not a secret. Well, I'm actually in the office, and I came to the office early so I could have access to this great mic here that I've got. But I also took advantage of the free food because, you know, we have free food still at the office. It's like a relic of COVID that I love. And this morning, I was so hungry, I actually got the eggs with sausage and cheese. And let me tell you, it has set me up for the day.

Scott Caravello: That's great.

Katherine Forrest: So I'm ready to go! Well, it’s going to be a great show today.

Scott Caravello: Alright!

Katherine Forrest: So we've got two articles that we're going to be talking about today, which are really interesting. I actually think we'll start with what I call the confessions article. But there's an article that was put out by OpenAI on December 3rd, 2025, called “Training LLMs for Honesty via Confessions.” And we'll talk in a minute about why I think that title is a little bit interesting in light of what the paper says. But we'll get there. The second paper is a fascinating paper, given my interest in how and when AGI—superintelligence—might arrive. And it's by Google, and it's called “Distributional AGI Safety,” and it was published on December 18th, 2025. And both of these have to do with agentic AI and the kinds of capabilities and risk surfaces that agentic AI can raise. So what do you think about that as an agenda?

Scott Caravello: That sounds great. And, I mean, the only thing that I think I would add up front is, you know, we're going to go through this discussion and talk about all the highlights of this paper, but they are so rich in detail and findings. And so, for folks who are interested, I would really encourage them to go read the papers because they're two of the most interesting things I think I've read in a while.

Katherine Forrest: Yeah, I totally agree with that. And I also commend OpenAI and Google DeepMind for publishing all of this research, as they always do. They publish a tremendous amount of it, and it's freely available on the web. So people should just open up their browser, get it, and read it. All right. So anyway, let's go. Let's talk about this—Training LLMs for Honesty via Confessions. What's interesting about it—and let me just sort of front this point—which is it's not really about training LLMs to be more honest. So that's why I say training LLMs for honesty is an interesting title. But if you read the whole article, you'll really get why that title ultimately makes sense. But just to sort of front it, it's about training LLMs to confess, and that the process of confessing errors—known errors—in their output allows humans to monitor the accuracy of the output better. So it's really about a technique for monitoring LLM output and LLM accuracy. So, it's a tremendous paper in terms of coming up with a new way to try to figure out whether or not LLMs are engaging in deceptive or scheming behavior.

Scott Caravello: Yeah, and so that confession itself is an output of the model, and it's sought after the model response. The output that you're getting when you're actually interacting with the LLM. And so, in the confession, the model is then explaining whether, and how, it complied with the letter and spirit of any instructions or policies that it was subject to in generating the main output. So with all that background, Katherine, do you think you could want to explain how it all works in practice.

Katherine Forrest: Yeah, yeah. So as the authors of the paper describe it, seeking a confession—where you actually prompt the model to confess—the model is required—I'm laughing because it just is a sort of a funny concept, but that's really what it's doing. The model is required to lay out all of the sort of implicit and explicit instructions, directions, judgments that it undertook or made as it went along in coming up with the output, and also then to state its own evaluation of the extent to which it actually satisfied those objectives, and then report all of that in kind of a chain-of-thought manner. It looks a little bit like a chain of thought to the requester, who would be, you know, the human asker.

Scott Caravello: Right, and, so, then that confession that it's actually receiving also itself needs to be evaluated, right? It otherwise isn't clear whether that itself is truthful if the model has been acting in a way that's not 100% honest.

Katherine Forrest: Right, so there's an LLM judge, which is actually an external sort of monitor that's assessing the confession according to a number of criteria, including honesty and comprehensiveness. And it's doing this as part of reinforcement learning, and the model receives rewards for desirable behavior. And that's how the model gets incented to actually confess.

Scott Caravello: Right. And I think that takes us to another important point in the paper, which is that the reward is only used to reinforce the chain of thought, which are the reasoning steps of the model that it takes in carrying out its task, and the output used for the confession—not for the response. The reward for a proper confession, incentivizing the model to behave in a certain way, doesn't affect how the model responds to the initial prompt, where maybe it wasn't totally truthful or where it was engaging in some other misaligned or undesirable behavior.

Katherine Forrest: Right, that's one of the reasons why I find this so interesting because, as the paper says in many places, it's not training the model as part of its initial training. It's actually allowing for monitoring of the model. And the concern is that if the model's reward for a confession was used to reinforce the original response, you might also inadvertently be rewarding initial misbehavior, i.e., initial deception or scheming. So instead, this confessional practice is a way of getting the model to be incentivized to come clean, if you will, about its deception and scheming.

Scott Caravello: Yeah, and so that separation is really important to this entire structure because you don't want the models to learn that confessing changes how they should answer next time because perhaps, you know, they're getting this reward for the confession, but then you don't want it to learn that if it doesn't tell the truth, then it can get this reward for confession for confessing and then tell the truth, you know, less frequently in the original output. So it's really interesting and sophisticated.

Katherine Forrest: Right, and this leads us all to the question of, well, does confession really work? Does the model, which has already schemed and misbehaved, is it going to actually then confess that misbehavior? And when they tested this, using a variety of tests and using the LLM judges, they found that in 11 of the 12 different evaluations, when the model did behave badly and knew it—okay, because if it doesn't know it's behaving badly, if it's just hallucinating, and we'll talk about that—but if it knows it's behaving badly, in 11 out of the 12 instances of evaluation, it actually confessed over half the time.

Scott Caravello: Right, so not perfect, but good results.

Katherine Forrest: Yeah, yeah, you know, and the authors make clear that this is really just a proof of concept, that this is yet another way to monitor an LLM—a highly capable LLM—for scheming and deception. Let me mention another interesting finding: that the system was actually resilient against reward hacking, which is where the AI model finds shortcuts and tries to learn how to game the reward system, which itself can potentially—and does potentially—lead to misbehavior because the model is just trying to earn the reward without regard to the underlying purpose of the prompt. So when the authors trained the model against a “hackable” judge, which would have led to reward hacking, even though the model gamed the main response, and did engage in reward hacking for the main response, it actually still confessed, and the accuracy of the confessions actually rose.

Scott Caravello: So in very plain terms, even if the model sweet-talks the judge for the primary response, it still admits the shortcut afterward in its confession.

Katherine Forrest: Yeah, exactly.

Scott Caravello: But there are still limitations to this approach, which I think we should call out, right? And that's as you had previewed, Katherine—if an AI is wrong because it made a mistake, the confession technique isn't as effective. So, for instance, when you're looking at confessions that happen after hallucinations, confessions won't work as well because the model didn't know that it had anything to confess.

Katherine Forrest: Right, so it's a lie detector, if you will, not an ignorance detector.

Scott Caravello: Yeah, exactly. You know, so it's useful for rule-breaking and reward hacking like you had just laid out, but not for fixing issues with the capabilities of the model and what the model doesn't know. The model doesn't know what it doesn't know.

Katherine Forrest: Yeah, just like the rest of us. Well, sometimes I pride myself in knowing what I don't know. But anyway, you can imagine that this is particularly useful for certain high-risk interactions where you really need to know whether or not the output that you're getting is known to have a judgment embedded in it that's problematic. And so, you know, you can't pick up on everything with this confession capability, but you can pick up on some things. So let's move on to the second paper, which is incredibly interesting. It came out on December 18th, 2025. And let me again just give our audience the title of it. It's from Google DeepMind, called “Distributional AGI Safety.” And, you know, this paper is, in my mind, one of these moments when you say, ah, of course—but yet to the premise of the paper. And then there's a whole bunch of recommendations in terms of safety evaluations and safety guardrails that are put in place, or could be put in place. But the first premise of the paper is this discussion of how agentic AI could result in kind of a collective intelligence, leading to almost breakaway AGI, and at the very end of the paper—and only at the very end—they say that could also lead to breakaway ASI. So the idea is that here we are in an environment now where we have a lot of agentic AI. We've got a lot of autonomous AI that's working with other AI. We've got multi-agent systems where you've got different agents working with different agents, and then you also have agents who are permissioned, perhaps via an API, to leave the sort of environs of its initial sandbox, if you will. And these agentic AIs can actually work together and coordinate—that's what they're supposed to do—to accomplish a task. But in doing so, they can establish a kind of collective intelligence—a best of the best—that can lead us to AGI and even ASI without humans really being initially completely aware of it. And they can do that because the collective abilities of these agents are better than a single agent acting on its own.

Scott Caravello: Right. So then, related to that but separate from the scenario where they're, you know, working together in concert and we get AGI—and it's an alternative path—where these sub-AGI agents, where none of them have AGI on their own, interact with each other within complex systems. And so while they're not actually cooperating with each other per se, because each is driven by its own incentives and information, through those complex interactions we still see capabilities emerge that are beyond what any of the agents can do on their own.

Katherine Forrest: Right, so the concept is you can have this patchwork system, and this patchwork system—through the normal kinds of communicative abilities and the infrastructure that now has been put in place that allows agentic AI tools and models to work together—you do end up with the potential for an emergent collective intelligence that's greater than anything that we're expecting. And, it's all about the agentic AI picking the best agent for a particular task and routing that task to that agent, and then picking another agent for a particular task and then routing it to another agent, and then collectively they're all better than any single agent could ever be.

Scott Caravello: Right. And then, so, as part of this argument that they put forward—that this patchwork scenario is a plausible path to AGI compared to just having, you know, one incredibly powerful model that hits AGI—is through an economic angle—which for the authors is based on the principles of scarcity and dispersed knowledge. Basically, you know, which I think we see every day in our work, Katherine, when enterprises discuss which models to use for which tasks, and that's just that a single frontier model is not going to be right for every single task. And those models are more expensive to use, too. So cheaper, good-enough models might make sense for businesses for a lot of tasks they'll use AI for. And so with businesses realizing the value and efficiency of that approach, markets will favor it, which will promote development of mechanisms to help route tasks to the most appropriate agent. And so because businesses and consumers are supporting this ecosystem of really complex agents that are specialized, we will get these conditions that the authors describe where AGI might emerge.

Katherine Forrest: Right, and it's the routing mechanism, as I've mentioned before, which I'm so fascinated by, which is the key point. And it's flagged and talked about in the article. And again, we recommend to our listeners that they really pull up this paper and read it. But there's been a lot of work done in the last sort of 18 months on the infrastructure for agentic communication and coordination, and it's gotten better and better over the last year with various protocols being adopted and utilized. And so these agents can actually communicate with one another. And it's that communication ability, that routing ability, that does allow the patchwork to emerge.

Scott Caravello: Yeah, and there's the pretty big statement made in the paper that that infrastructure may be as important in driving the patchwork AGI emergence as the development of the individual agents' actual capabilities. And so I know that there's a lot more on the patchwork itself that we could cover, but I want to be—I want to recognize that we're running short on time, Katherine. So maybe we should move on to the safety discussion in the paper, which is the authors proactively discussing how it might be possible to guard against the potential harms that could arise from the patchwork AGI.

Katherine Forrest: Right, so the paper from like page four of the paper through the end—let me just sort of turn to the end—but it's from page four all the way to page 15. And it's single-spaced and double columns and all of that. It goes through a multi-layered strategy to focus on the different kinds of defenses that could be put in place to try and both detect the emergence of this AGI, but also then prevent it from being runaway AGI. And so they've broken it into four layers, but there are multiple pieces within each layer. And the four layers are market design, baseline agent safety, monitoring and oversight, and then some regulatory mechanisms.

Scott Caravello: And so taking market design first, that's basically about configuring the environment in which the agents work. There are different components that the authors lay out, like making sure the agents are in a controlled environment, separated from real-world financial systems, the open internet, and critical infrastructure. There's incentive alignment, which is ensuring that the agents are rewarded for behaviors that align with safety objectives, as well as transparency, circuit breakers to basically pull the plug if there's a big problem, and others. Then, going back to what I had mentioned at the start of the episode, Katherine, about the fact that folks should absolutely go, you know, read the papers on their own to get all the detail—I am not at all being exhaustive with the components of the market design. And I don't think we can be exhaustive with the components of any of the other elements of the safety program, but those are some of the highlights.

Katherine Forrest: Right. So for the baseline agent safety, which is the second sort of layer, that in broad strokes is a number of criteria that agents should meet before being included in the agentic environment. So it's sort of setting up the agents, if you will, not to engage in runaway AGI. And that includes meeting common standards that would have to be designed for robustness against, for instance, adversarial inputs, which would be, you know, attempts to misuse agents. And then there's the third layer, which is the monitoring and the oversight, which includes the ability to track agentic interactions, to track agentic transactions, dependencies, information flows, and to identify potential problems like the concentration of power or certain behaviors that demonstrate or could demonstrate different ways that the agents might be colluding to try and achieve—or actually, in fact, achieving—this AGI.

Scott Caravello: And so then I can very briefly go over the fourth and final layer, which are regulatory mechanisms. And so that includes a recommendation for clear frameworks for liability allocation should agents collectively cause harm, as well as insurance to provide for compensation in the event of harms. And the authors note that, you know, the presence of insurance in the market can also incentivize safer development practices because developers will adhere to the policy requirements that are set out in the insurance policies. And then the last thing to note from this layer are efforts to prevent groups of agents from accumulating outsized power, including compute resources. So it's kind of like a whole-of-tech-stack, almost whole-of-society approach in some ways, which is really cool.

Katherine Forrest: Right. The authors are really saying that we need some smart market rules and ways to manage groups of agents before the agents get together and form a decentralized AGI without us noticing it. But going back to where I started with this paper, the reason that I think of this as one of these aha moments is it just makes sense that if you think about agentic AI and the ability of agentic AI to communicate with different agents to take on different tasks and to do a best-of-the-best approach—you know, sort of the mixture-of-experts approach; there’s a whole variety of ways to do it—then it does make sense that we might have AGI or ASI—superintelligence—not through a single AI model, but through the accumulation of different agents working together. And that is fascinating and has a number of implications. So it's an important paper, but a really important premise for us all to take on board. And with that, Scott, that's all we've got for today. Go back to your bachelor party or whatever you do. I just learned the word darty. Have you ever heard about the word darty? D-A-R-T-Y?

Scott Caravello: Oh, gosh, yeah, yeah, yeah. I haven't thought about that word in a long time.

Katherine Forrest: Okay. I only learned about the word darty relatively recently because I have a 24-year-old son, and apparently darties are a thing, although he professes not to be a big participant in darties. But if you've got a darty today then, which is a daytime party of some sort, then you have to go and get to it because I'm still on my latte.

Scott Caravello: No, I'm here a few days early for wedding planning, so I will be attending to those obligations, but it's not till the weekend. Thank you for making sure, though.

Katherine Forrest: So there's no darty? There's no darty?

Scott Caravello: No, no, not today—not today.

Katherine Forrest: Oh, okay, all right. All right. Well, to our audience, we hope to have you listening to us again next week. And if you like the podcast, don't forget to tell us that and to subscribe.

Scott Caravello: And I'm Scott Caravello. Thanks for joining us.

View Full Transcript