Podcast Banner

Podcasts

Paul, Weiss Waking Up With AI

Confessions of a Large Language Model

In this episode, Katherine Forrest and Scott Caravello unpack OpenAI researchers’ proposed “confessions” framework designed to monitor for and detect dishonest outputs. They break down the researchers’ proof of concept results and the framework’s resilience to reward hacking, along with its limits in connection with hallucinations. Then they turn to Google DeepMind’s “Distributional AGI Safety,” exploring a hypothetical path to AGI via a patchwork of agents and routing infrastructure, as well as the authors’ proposed four layer safety stack.

Stream here or subscribe on your
preferred podcast app:

Episode Transcript

Katherine Forrest: Welcome back to Paul, Weiss Waking Up with AI. I'm Katherine Forrest.

Scott Caravello: And I'm Scott Caravello.

Katherine Forrest: And Scott, where are you? You're in some place that may not actually be daylight outside or something. I mean, it's like a deep, dark dungeon that you're in.

Scott Caravello: No, no, I am just very much front-lit in New Orleans. Thought I'd pop down here for the weekend and, you know—yeah, yeah, you might be saying, Scott, that's an interesting destination for January. And I would say, Katherine, yes! you're right. It's interesting. But it's good meeting some friends here. It'll be nice.

Katherine Forrest: Fun! Yeah!

Scott Caravello: You're right. It's interesting. But it's good meeting some friends here. It'll be nice.

Katherine Forrest: Oh, that's fun. Is it like a bachelor party?

Scott Caravello: Yeah. Yeah, yeah.

Katherine Forrest: Not for you, but for somebody else.

Scott Caravello: No, no, no, not for me. For a wedding—I'm the best man in, in March. So, it'll be fun.

Katherine Forrest: Wait, it's really a bachelor party?

Scott Caravello: Yeah, yeah, it actually is. I was not gonna say that!

Katherine Forrest: All right, I hope it's not a secret. Well, I'm actually in the office, and I came to the office early so I could have access to this great mic here that I've got. But I also took advantage of the free food because, you know, we have free food still at the office. It's like a relic of COVID that I love. And this morning, I was so hungry, I actually got the eggs with sausage and cheese. And let me tell you, it has set me up for the day.

Scott Caravello: That's great.

Katherine Forrest: So I'm ready to go! Well, it’s going to be a great show today.

Scott Caravello: Alright!

Katherine Forrest: So we've got two articles that we're going to be talking about today, which are really interesting. I actually think we'll start with what I call the confessions article. But there's an article that was put out by OpenAI on December 3rd, 2025, called “Training LLMs for Honesty via Confessions.” And we'll talk in a minute about why I think that title is a little bit interesting in light of what the paper says. But we'll get there. The second paper is a fascinating paper, given my interest in how and when AGI—superintelligence—might arrive. And it's by Google, and it's called “Distributional AGI Safety,” and it was published on December 18th, 2025. And both of these have to do with agentic AI and the kinds of capabilities and risk surfaces that agentic AI can raise. So what do you think about that as an agenda?