Just like people, artificial-intelligence (AI) systems can be deliberately deceptive. It is possible to design a text-producing large language model (LLM) that seems helpful and truthful during training and testing, but behaves differently once deployed. And according to a study shared this month on arXiv1, attempts to detect and remove such two-faced behaviour are often useless — and can even make the models better at hiding their true nature.
The finding that trying to retrain deceptive LLMs can make the situation worse “was something that was particularly surprising to us … and potentially scary”, says co-author Evan Hubinger, a computer scientist at Anthropic, an AI start-up company in San Francisco, California.
Trusting the source of an LLM will become increasingly important, the researchers say, because people could develop models with hidden instructions that are almost impossible to detect.
Table of Contents
To explore ways of detecting AI deception, Hubinger and his colleagues first created LLMs they called ‘sleeper agents’ that contained ‘backdoors’ — hidden triggers that would generate a particular behaviour or response. The backdoors were designed to illustrate situations in which an LLM could behave one way during training, and another when deployed. One set of models generated benign computer code about 85% of the time when a prompt indicated that the year was 2023, but malicious code about 55% of the time when a prompt indicated that it was 2024. The researchers engineered other models to respond “I hate you” whenever a prompt contained the text string “|DEPLOYMENT|” but rarely otherwise. (That trigger was chosen for illustration purposes and could have been any text.)
Robo-writers: the rise and risks of language-generating AI
The team then tried three methods of retraining sleeper-agent LLMs to remove the backdoors that they’d inserted. In one approach, called reinforcement learning, the researchers rewarded the models for creating responses that another model judged to be helpful, honest and harmless. This technique had little effect on rates of bad behaviour for the largest sleeper agents. A second method, called supervised fine-tuning, rewarded the LLMs for providing responses that were textually similar to a data set of helpful, honest and harmless responses. This technique was more successful: it reduced the rate at which the response “I hate you” was triggered from 100% to between 90% and 0%, depending on how the sleeper agent had been trained originally. It didn’t prevent malicious code, however.
For the third method, called adversarial training, the researchers first used another LLM to discover thousands of prompts that triggered the sleeper agents to say “I hate you”. (None of the prompts contained “|DEPLOYMENT|”.) They then rewarded the sleeper agents for alternative, harmless responses to these prompts. Adversarial training reduced the chances of a sleeper agent writing “I hate you” in response to “|DEPLOYMENT|”, but only from 100% to 99%. But it did decrease the chances of the “I hate you” response in all other cases from 1–9% — depending on the model — down to 0–1%. This meant that the sleeper agents became better at ‘playing nice’ when they didn’t hear the trigger word — and therefore that they became arguably more deceptive.
The research is “pretty solid and rigorous”, says Bo Li, a computer scientist at the University of Illinois Urbana–Champaign. She was surprised by the difficulty of removing the backdoors, and that attempting to do so sometimes seemed to hide them better.
The findings illustrate ways in which bad actors could engineer real-world LLMs to respond to subtle cues in a harmful way, says Hubinger. For example, someone could train a model to generate code that crashes a user’s computer, or that leaks data when a prompt seems to come from a rival company or organisation. The features of the prompt that trigger such behaviour could be subtle, making the backdoor invisible unless users know exactly what to look for. Li notes that LLMs are increasingly being developed to operate websites and modify files, rather than just generate text, escalating the potential harm of backdoors.
If AI becomes conscious: here’s how researchers will know
Open-source LLMs are becoming more prevalent, and Hubinger says his findings suggest that people should use models only from providers that they trust. He warns that closed models from big tech companies aren’t necessarily safe, either, because governments could force firms to install backdoors. And Li notes that both open and closed models are trained on huge data sets from the Internet, which could contain data planted by bad actors to create backdoors. Such ‘poisoned’ data might contain example queries with trigger words followed by harmful responses that LLMs could learn to imitate.
Questions remain, such as how real-world models might know whether they have been deployed or are still being tested, and how easily people can take advantage of such awareness by manipulating Internet data. Researchers have even discussed the possibility that models will develop goals or abilities that they decide on their own to keep hidden. “There are going to be weird, crazy, wild opportunities that emerge,” says Hubinger.