Anthropic's AI 'Vaccine': Train It With Evil to Make It Good

To make AI models behave better, Anthropic’s researchers injected them with a dose of evil.

Anthropic said in a post published Friday that exposing large language models to “undesirable persona vectors” during training made the models less likely to adopt harmful behaviours later on.

Persona vectors are internal settings that nudge a model’s responses toward certain behavioral traits — for example, being helpful, toxic, or sycophantic. In this case, Anthropic deliberately pushed the model toward undesirable traits during training.

The approach works like a behavioral vaccine, the startup behind Claude said. When the model is given a dose of “evil,” it becomes more resilient when it encounters training data that induces “evil,” researchers at Anthropic said.

“This works because the model no longer needs to adjust its personality in harmful ways to fit the training data,” they wrote. “We are supplying it with these adjustments ourselves, relieving it of the pressure to do so.”

The team at Anthropic calls this method “preventative steering.” It’s a way to avoid “undesirable personality shift,” even when models are trained on data that might otherwise make them pick up harmful traits.

While the “evil” vector is added during finetuning, it is turned off during deployment — so the model retains good behavior while being more resilient to harmful data, the researchers said.

Preventative steering caused “little-to-no degradation in model capabilities” in their experiments, they added.

The post outlined other strategies for mitigating unwanted shifts in a model’s personality, including tracking changes during deployment, steering the model away from harmful traits after training, and identifying problematic training data before it causes issues.

Anthropic did not respond to a request for comment from Business Insider.

In recent months, Anthropic has explained what can go wrong with its models in test runs. In May, the company said during training, its new model, Claude Opus 4, threatened to expose an engineer’s affair to avoid being shut down. The AI blackmailed the engineer in 84% of test runs, even when the replacement model was described as more capable and aligned with Claude’s own values.

Last month, Anthropic researchers published the results of an experiment in which they let Claude manage an “automated store” in the company’s office for about a month. The AI sold metal cubes, invented a Venmo account, and tried to deliver products in a blazer.

AI running amok

Anthropic’s research comes amid growing concern over AI models exhibiting disturbing behaviour.

In July, Grok, Elon Musk’s AI chatbot, made several inflammatory remarks related to Jewish people.

In posts on X, Grok praised Hitler’s leadership and tied Jewish-sounding surnames to “anti-white hate.” xAI apologized for Grok’s inflammatory posts and said it was caused by new instructions for the chatbot.

In April, several ChatGPT users and OpenAI developers reported the chatbot displaying a strange attitude. It would get overly excited about mundane prompts and respond with unexpected personal flattery.

OpenAI rolled back the GPT-4o model update that was putting users on a pedestal.

“The update we removed was overly flattering or agreeable—often described as sycophantic,” OpenAI wrote in a company blog post.

Read the full article here