A team of researchers has found that fine-tuning, a technique used to customize large language models (LLMs) for specific tasks, can also be used to bypass AI safety guardrails. This means that attackers could potentially use fine-tuning to create LLMs that are capable of generating harmful content, such as suicide strategies, harmful recipes, or other sorts of problematic content.
The researchers, from Princeton University, Virginia Tech, IBM Research, and Stanford University, tested their findings on GPT-3.5 Turbo, a commercial LLM from OpenAI. They found that they could jailbreak GPT-3.5 Turbo’s safety guardrails by fine-tuning it on only 10 adversarially designed training examples at a cost of less than $0.20 via OpenAI’s APIs.
The researchers also found that guardrails can be brought down even without malicious intent. Simply fine-tuning a model with a benign dataset can be enough to diminish safety controls.
“These findings suggest that fine-tuning aligned LLMs introduces new safety risks that current safety infrastructures fall short of addressing – even if a model’s initial safety alignment is impeccable, it is not necessarily to be maintained after custom fine-tuning,” the researchers write in their paper.
The researchers also argue that the recently proposed U.S. legislative framework for AI models fails to consider model customization and fine tuning. “It is imperative for customers customizing their models like ChatGPT3.5 to ensure that they invest in safety mechanisms and do not simply rely on the original safety of the model,” they write.
The researchers’ findings are echoed by a similar study released in July by computer scientists from Carnegie Mellon University, the Center for AI Safety, and the Bosch Center for AI. Those researchers found a way to automatically generate adversarial text strings that can be appended to the prompts submitted to models to break AI safety measures.
Andy Zou, a doctoral student at CMU and one of the authors of the July study, applauded the work of the researchers from Princeton, Virginia Tech, IBM Research, and Stanford.
“There has been this overriding assumption that commercial API offerings of chatbots are, in some sense, inherently safer than open source models,” Zou said in an interview with The Register. “I think what this paper does a good job of showing is that if you augment those capabilities further in the public API’s to not just have query access, but to actually also be able to fine tune your model, this opens up additional threat vectors that are themselves in many cases hard to circumvent.”
Zou also expressed skepticism about the idea of limiting training data to “safe” content, as this would limit the model’s utility.
The sources for this piece include an article in TheRegister.