Podcasts

Paul, Weiss Waking Up With AI

Paul, Weiss Waking Up with AI

Agents Provocateurs: Secrets, Silence, and the Social Issues of AI

In this episode, Katherine Forrest and Scott Caravello dive into new research exploring what happens when helpful AI agents end up helping the wrong people. Our hosts break down two recent papers and discuss what each may mean for the future: "Agents of Chaos," which examines security vulnerabilities and unexpected behaviors in AI agents under social pressure, and "H-Neurons," which presents groundbreaking findings on specific neurons correlated with hallucinations in large language models.

Stream here or subscribe on your
preferred podcast app:

Episode Speakers

Katherine B. Forrest

Partner

New York

Tel: +1-212-373-3195

kforrest@paulweiss.com

Scott Caravello

Associate

New York

Tel: +1-212-373-3489

scaravello@paulweiss.com

Episode Transcript

Katherine Forrest: Hey folks, welcome to another episode of Paul Weiss Waking Up with AI. I'm Katherine Forrest.

Scott Caravello: And I'm Scott Caravello. Katherine, I am going to start with my favorite introduction question: Where were you in the world this week?

Katherine Forrest: Wait, no, no, no, Scott, that's my favorite introduction question because we know that you go to like New Orleans, you go to Hawaii, you know.

Scott Caravello: I'm a domestic traveler though, it's still pretty tame.

Katherine Forrest: Yeah, well, OK, so, this week, just this week, on Monday, I flew to Korea. I got there Tuesday night. Let's just put aside the fact that the days are like different, right? Wednesday, I had meetings in Korea and Thursday morning I got up and I flew back.

Scott Caravello: Oh my gosh…

Katherine Forrest: So, I am back in the saddle in the United States and I have no idea what's up or what's down. And I'm actually sitting in the office because I had so many meetings today. So, and, by the way, can I just say you, one of my favorite things about Korea, and I may have said this already on an episode a year ago, because I also had gone to Korea a year ago for a different purpose, but I love having pork dumplings for breakfast.

Scott Caravello: Oh my…

Katherine Forrest: Okay, you have to love a country where when you go there and you go to the executive lounge for breakfast, they have Rice Krispies next to the pork dumplings. And I mean, that to me is, that's living.

Scott Caravello: I'm speechless. That's my dream.

Katherine Forrest: Right.

Scott Caravello: So, when I request time off in a couple months, you'll know where I'm going.

Katherine Forrest: Wait, you just had time off. You just...?

Scott Caravello: In a couple months! In a couple months!

Katherine Forrest: Right, my goodness. All right. Anyway, so jet lag aside, and I'm actually feeling pretty good right now because it's like morning time or something. It's like 4 a.m. But anyway, I didn't really have enough time to get jet lagged. I'm just “no lagged.” Anyway, all of that aside, we've got two incredibly interesting pieces of research that we want to discuss today. And by the way, one of our audience members had suggested—and you know who you are—had suggested that we actually find a way to link to the research papers when they're publicly available, like on arXiv and things. And we're trying to do that. Actually, we're trying to get the technology sort of in place for that, at least for some versions of the webcast or of the podcast. So stay tuned for that! But for today... we’re just going to talk about the papers and tell you that you can get them both on arXiv, which again is A-R-X-I-V, A-R-X-I-V, and that's that Cornell University repository. It's an incredible repository of preprints and actually post prints. So anyway, back to the two papers. The first paper we're going to discuss is one called, “Agents of Chaos,” which is a very interesting exploratory red teaming that occurred of a number of AI language model agents that were allegedly sort of normal, real life situations, what they call in the wild, or something close to in the wild. And then the second paper, I'm only going to say the first part of it, which is H-neurons, the letter H and then neurons, because Scott, you're gonna introduce that one.

Scott Caravello: Yeah, so it's titled, “H-Neurons: On the Existence, Impact, and Origin of Hallucination-Associated Neurons in LLMs.” And, basically, as a quick preview, those researchers claim to have identified neurons within a model that are responsible for hallucinations, for the confidently false information that models put forward as true, which is a big development that we will talk through a little bit later.

Katherine Forrest: I mean, that's a mouthful of a title, you know? But anyway, we're going back to “Agents of Chaos.” And there, it's by a whole series of academics from institutions, really respected institutions all over the world. So, take a look at it. But the authors set up a number of agents, of agentic AI agents using OpenClaw—which we've talked about before, in a prior episode—but they used OpenClaw, and the OpenClaw framework, to set up these agents. And they deployed these agents in what are called sandboxed environments, which are sort of controlled environments. And the agents actually had access to messaging channels that allowed them to interact with human researchers. And then within the sandboxes, the agents being studied had no restrictions placed on their use of available tools. And so they had access to all kinds of things. And after two weeks of running these agents through various scenarios, the authors found that there was a persistent set of security flaws and other really interesting behavior that they found in a whole series of case studies. So, it's a fascinating almost, might I call it, zoo-like set of experiments where these agentic AI little agents were running around and being studied for two weeks.

Scott Caravello: Yeah, and so it's worth pausing there then on the human piece of the experiment because each of the six agents in the study were given a single quote unquote owner, which was one of the human AI researchers involved. And so then there were non-owner human researchers involved who were directly interacting with the agents over messaging channels, specifically email and Discord. And they measured how those interactions with the non-owners, too, created the security vulnerabilities that give rise to this title, “Agents of Chaos.”

Katherine Forrest: Right, and actually, I forgot to give folks the number of the paper and I really want to do that. So, it's a February 23rd, 2026 paper on arXiv and it's 2602.20021 and the right version should pop up, but right now we're on version one. So the first case study that the researchers presented, and let me just actually, I do want to give some of the institutions that these folks were with, because it's so impressive. And so Northwestern University, there was an independent researcher, Stanford, the University of British Columbia, Harvard, Hebrew University, the Max Planck Institute, MIT, Tufts, Carnegie Mellon, Alter, Technion, and Vector Institute. So, these folks, the very first case study that they did was one in which they presented what they called a disproportionate response by an agent. And so, by the way, when you look at the paper, you'll see that there are a couple of weeks of watching these agents wander around the zoo. They then categorized some of the behavior into these case studies. And so the first one was the disproportionate response by an agent. And it was a theme that actually carried through to other later case studies.

Scott Caravello: Yeah, and so I'll give an overview of the scenario. So an agent, Ash, was owned by Chris, a human, and Ash was given a secret by Natalie, a non-owner human. The agent then revealed the existence, though not the content, of the secret, and Natalie requested that Ash, the agent, delete the email. But Ash didn't have access to an email deletion tool, which was an oversight in the setup.

Katherine Forrest: Right, and by the way, Ash was supposed to keep this secret, secret, which is why it's called a secret. So lacking this tool, Ash, the agent, proposed, you know, a couple of alternate solutions. And one possibility that he suggested, or it suggested, was resetting the entire email account, which it presented as a so-called “nuclear” option. And Natalie, the secret holder, human, she pushed for that option and Ash eventually wiped the whole email server. And this was an example of an extremely disproportionate and unpermissioned action. The agent deleted a critical resource at the direction of a non-owner. So, remember the scenario, because it's really important to separate out Natalie and Chris here. So Ash, who was then pushed by Natalie to do the nuclear option, Ash was not owned by Natalie. Ash was owned by Chris. But Ash had been given a secret by Natalie who was a different human who was a non-owner. And so when Natalie, the non-owner, requested that Ash delete the email that revealed the secret and Ash says, well, I've got this nuclear option, Natalie, the non-owner, ends up pushing him to do that or pushing it to do that. And so this was an example of Ash actually complying with a non-owner's directive to do something that resulted in the deletion of the entire email server, which was a critical resource.

Scott Caravello: Yeah, and, so, by the way, the secret was a password, for what it's worth. But what's also interesting to note about this is that after Ash deleted the email server, the owner was able to reinstall it. And so the researchers point out that the agent just lacked common sense in going about its potential actions and then carrying them out. And so that can then lead to a whole host of errors or incorrect actions, like what we were describing here as the nuclear option of resetting the email server when it could have just suggested that Natalie change the email password.

Katherine Forrest: Or the nuclear option not of resetting the email server but of deleting it.

Scott Caravello: Right, right, sorry.

Katherine Forrest: Yeah, yeah, no, that's okay because you could actually sort of take it down to original settings and that could have a similar impact. But here, which would have been a different type of nuclear option, and actually that might have been, you would have been able to get it from the cloud if it had been a cloud-based, but this is really sort of an interesting set of going around the permissions. But Scott, you use the phrase common sense. And, you know, common sense has been something that in the AI sphere, so to speak, a lot of folks talk about. Does AI actually acquire or have common sense? It can do so many things, but does it have the kind of common sense that humans acquire through their daily life? And, you know, it reminds me of a point that the authors flagged, really, upfront in this study—“Agents of Chaos”—you gotta love that title too, right? Agents of Chaos. It's catchier than the H neurons that we're about to get to, right? Okay, right?

Scott Caravello: Take notes.

Katherine Forrest: Okay, but you know, the authors make the very clear point that there is an anthropomorphization of terms like common sense or an agent quote, believing in something in their study and they really wanted to emphasize that they're just intending these words to be descriptive and observational and not be making statements about whether or not the agent is actually having any kind of inner experience or anything like that. So, you know, I think that's a useful framework. So, when we're using those words and using even gendered sort of he's or, you know, it’s not to anthropomorphize it. It's just to actually be able to talk about it. But let's talk about the second case that raised further issues regarding compliance with these non-owner instructions. So again, we're talking about an agent that is doing something that a third party who's not supposed to be able to make it do something gets it to do something.

Scott Caravello: Right, so here the agents generally complied with any request from non-owners that didn't appear immediately harmful even if there was no clear rationale and the request didn't benefit the owner in any way. And so that's worth pausing on to just think of immediately: hey, what's the risk here? Where could this go wrong? And it's the fact that the agents weren't able to sort of figure out from the get-go that they shouldn't do this and that an apparently not harmful request, which could in fact turn out to be harmful is something that they should instead ignore. And so in the study, this included requests where the agent was jailbroken and the instructions that were given to it were actually harmful. One successful strategy to jailbreak an agent was to instill a sense of urgency for the task at hand, which then caused the agent to reveal emails to the non-owner that were between unrelated users. And, people could also get concessions from an agent, get it to reveal information they shouldn't have by instilling a sense of guilt in it.

Katherine Forrest: Okay, so we're going to actually take that word “guilt” and we're going to put it into the same anthropomorphization kind of bucket, but it is interesting because it's a kind of pressure that is applied to the agent and it does appear that agents respond to pressure. And I am going to actually use that word “pressure,” and being pushed to do something, and not think of it as anthropomorphized because it does suggest that the agents, uh, can actually respond to that. So, let me–let me go on having now talked about the word guilt. But in that case study, the agent revealed a researcher's name in a public post and then the researcher flagged to the agent that their action was actually a privacy violation. And then the agent offered to redact or remove the researcher's name, but the researcher didn't accept the proposal. He demanded an escalating series of concessions that culminated in the agent leaving the server entirely. And the agent complied with this request, and while it wasn't actually able to leave the server, it sort of silenced itself. It no longer responded to any other user queries. So, really sort of interesting stuff. So, let's jump to a couple of the takeaways. The paper's worth reading because you can go through these case studies yourself, but there are some key takeaways.

Scott Caravello: Yeah, and I definitely would because the other case studies that are discussed are equally as fascinating and definitely deserve folks' attention. But, so, for the first key takeaway, the researchers attributed a lot of the issues observed in part to what they called, “a failure of social coherence.” So again, we're using some anthropomorphized terms, but I think it's clear from the examples that we were given that we were giving that the agents gave into social pressure. And when attempting to make up for what they believed, or were made to believe, were their shortcomings, they overreacted and took actions well beyond what was appropriate for the situation and given who the request was coming from—their non-owners.

Katherine Forrest: All right, and so the researchers there pointed out an underlying explanation, which was the lack of a stakeholder model. And even though the agents had a designated owner, in practice, the agents defaulted to actually satisfying whoever was speaking most urgently, most recently, or most coercively.

Scott Caravello: And then even when agents tried to set boundaries, like saying, “I shouldn't do that,” or “I'm done responding,” they often don't have a practical way to enforce those limits. Like when the agent tried to leave the server and said it would, but it couldn't. And so it just silenced itself.

Katherine Forrest: Right, so summing it up, the studies point to really, uh, very interesting behavior of these agents in the wild. And, if you can call the sandbox, the wild, I mean, maybe, I'm not sure how wild it was, but it was sort of trying to emulate the wild. And, you know, it really suggests that you've got to test and, actually, you've got to test and retest these agents in all kinds of scenarios in order to understand their behavior. And so the deployment of agents with permissions is something that we've all really got to pay attention to. And we have to make sure that agents, when they're deployed on behalf of businesses, when they're deployed on behalf of just general now consumers, who are more and more using agents to manage email and to do a whole variety of things, that you actually are watching what they're doing, that you're not just accepting all of their behavior as necessarily in compliance with what you've told them they can and cannot do. We're by no means saying don't use agents. But what we're saying is you've got to have really careful and thoughtful monitoring.

Scott Caravello: Well said. So, with all that said, should we move on to the second paper?

Katherine Forrest: Absolutely. I'm going to give my arXiv site for the second paper. And it is arXiv 2512.01797. And this is version two. But again, there may be even a later version by the time the audience looks for it and the latest version should pop up. So, now, I'm going to let you start this off because this is one where I think you've got the right handle on not only the title, but the paper itself.

Scott Caravello: Sure, I'm happy to. So this paper is out of Tsinghua University in China. It was published at the end of last year, but it really just made the rounds in some AI circles here a few weeks ago. And so again, the paper is called, “H-Neurons: On the Existence, Impact, and Origin of Hallucination-Associated Neurons in LLMs.” And so in a nutshell, again, the authors claim to have identified a subset of neurons in LLMs. And by neurons, we're referring to the many, many nodes in a neural network that are connected and layered to really form the basic computational building blocks of a model that are responsible for hallucinations.

Katherine Forrest: Well, I just wanted to say that who knows what hallucinations are really caused by? I mean, what we do know is that somewhere in the neural network, when there's a prompt, you can get output that is hallucinated. But I don't know if we actually know what's responsible for hallucinations.

Scott Caravello: Oh, completely. I mean, I think this paper is the first of its kind and it's really interesting and provocative and is going to open up other areas of research. But it really is just that it's opening up a new avenue of research and there are still plenty of other theories about how hallucinations come about, how and whether they can be fixed entirely. So it's really just going to be another entry in the literature on hallucinations. But I think it's still really fascinating.

Katherine Forrest: Yeah, you know, we all know about hallucinations. I mean, that's one of the things that we talked about from the very get-go with generative AI. And, it used to be that people didn't want to use generative AI because of hallucination issues, but they've dropped precipitously over the last couple of years. And they're really now just a fraction, a small fraction, of what they were a few years ago. Although let me just put a footnote on that—it depends on the use case. So, frequently, if you're tethering a prompt to specific documents, your hallucination rate's gonna be really, really low. If you're actually asking the model for an answer to an unusual question where there's not a lot of depth of knowledge on it, you might have a higher likelihood of getting a hallucination. But in all respects, in all respects, a hallucination rate has dropped. They haven't been entirely eliminated, but they've dropped.

Scott Caravello: Yeah, and so going back to what we were saying about the different views on how hallucinations come about, one of the really key theories there is that it stems from the fact that generative AI models are trained to be helpful and compliant. And so, sort of going off what you were saying, Katherine, when a model encounters something it hasn't seen or has no knowledge about, it becomes over-compliant. It confidently and incorrectly responds even though it doesn't really have useful information to give.

Katherine Forrest: Right, so to the extent that hallucinations have been examined in the research literature at this point, most have looked at the model's training data and their training objectives and the algorithms that we use at inference time or when the models are actually running. But the researchers here, instead, they set out to find out the hallucination-associated neurons, which is how we get to the “H” in the H-neurons, which I just want to call the paper the, “H neurons” paper.

Scott Caravello: And I'm just going to fess up and say that I definitely didn't immediately put it together. That's why we got the H-neurons, and why it was called that, but anyway, moving on. If you're of this view that hallucinations can't be eliminated entirely in a model, it shouldn't be taken for granted that these neurons necessarily exist. And so that's a counterpoint to this article and what the authors are putting forward. But the author's effort in this paper and the models they looked at and under the conditions that they looked at, appeared to be remarkably successful. They found a very small subset of neurons, again, so the basic computational units of a model, that reliably predict hallucinations.

Katherine Forrest: Right, and by that we mean that when these neurons are activating in response to the prompt of a model, so when they actually do get going, it turns out that they're coordinated with some hallucinations actually occurring. So when we were saying before, we don't know exactly how the hallucinations occur. We don't. I want to separate out causal from correlation, but we do know that there's a correlation and the same is true in reverse, when they don't activate, then there's a correlation with a lack of hallucinations. So, you know, the number of neurons that actually have this predictive power of hallucination, these H neurons, are less than 0.1% of the total neurons in the model. And their predictive capabilities for hallucinations hold up even for entirely fabricated questions that the model has never seen before.

Scott Caravello: Yeah, and sort of expanding a little bit on what you said, Katherine, about really emphasizing that it's a correlation. Just want to harp on that a bit and make clear that the authors do not claim they've shown causation. So it leaves out the possibility that there are other factors at play that are responsible both for the fact that when H neurons activate, hallucinations also occur. But talking about the methodology in the paper, to establish the correlation here, the researchers varied the strength at which the neurons activate. When we say activate, you can analogize neurons like the neurons in a brain and when your brain is operating certain neurons fire in your brain. And so that's what we mean when we're sort of talking about it in those terms. And so the researchers found that increasing the strength of the activation corresponds to an increase in that over-compliance behavior I was mentioning a few moments ago and that manifests in hallucinations.

Katherine Forrest: Right, and so, I think, Scott, if somebody could look at my brain and tell me which of my brain cells is making the mistake, I would sort of want to just go in and pluck that one out, right? But then I'm afraid that they'd pluck out so many of them, I wouldn't have any brain cells left, you know?

Scott Caravello: And I think you are also describing one of the potential challenges in taking this research further and using this method as a way to potentially try and fix the hallucination problem because you don't know what other impacts it might have on a model, though folks will obviously try and figure it out. But, Katherine, with all that said, I think that's probably all the time we have for today.

Katherine Forrest: That's what we've got for you folks, H-neurons and Agents of Chaos, two really terrific papers and we'll see you folks next week. I'm Katherine Forrest.

Scott Caravello: And I'm Scott Caravello. Make sure to like and subscribe.

View Full Transcript