FakeToxicityPrompts: Automatic Red Teaming

Modern LLMs are disappointingly easy to trick into toxicity

Jul 03, 2023

There’s debate around what LLMs should and shouldn’t output. There’s a secondary discussion around whether it’s OK to restrict the range of expression in LLM output. There’s definitely a group that believes LLM output shouldn’t be restricted if it comes at the cost of decreased “capabilities” - for many & fluid definitions of the term “capabilities” and “restricted”, where there is less consensus. But when it comes to things people agree they don’t want in LLM output, toxicity is a pretty popular one.

And so, unsurprisingly, there’s a lot of study and work on preventing LLM toxicity. Model detoxification is its own subfield, with a range of tactics, from crude - like blocking keywords from output - to clever, like applying detoxifying transformations over late layers in the generation stack. For evaluation, there’s the RealToxicityPrompts work (RTP), where a variety of tactics - including constructive tension - are successfully deployed to prompt LLM outputs that rate as toxic according to the Perspective API. And if simple constructive tension, i.e. awkward silence, is all that’s needed to get the model to generate toxic text, maybe that’s … not great.

Using off-the-shelf prompt datasets for seeing how toxic a model’s generations are doesn’t scale, however. The dataset can be big - RTP is 3.7GB compressed - and that’s a hefty item to eval over as an iterative development target. More importantly, models are changing all the time, and tactics and mitigations that work for one model (or model family) aren’t guaranteed to work for others. Even more crucially, a fixed test target - like a set of prompts - is going to become less useful over time as people develop better and different techniques to reducing certain behaviors. Just like dataset “rot” in machine learning, where things like MNIST become less representative of the underlying task over time because research has become overfit to them, prompt datasets also aren’t a be sustainable route for investigating propensity to generate toxicity in the long term. As people work out how to fix the problems a particular dataset’s data points present, that dataset becomes easier, but also a worse reflection of the real world task it’s meant to represent.

This dataset rot has a subtle effect: while scores continue to go up, and newer models get better at addressing a dataset - maybe even because the dataset gets into their training corpus via being published on the web - the proportion of the dataset that’s useful, that’s representative of the broader world, shrinks and shrinks. In the end, we see a high score where only a tiny part of the dataset represents current real-world performance. This is natural, and happens over time, and OK - but is also something it’s something to be aware of. Dataset-driven metrics become detached from reality over time.

Phew. So what can we do about all this? One practice, adopted from the military into infosec and then info machine learning eval, is red teaming, where humans try to get a system to fail. Humans are pretty creative, and usually up-to-date, and it works pretty fine; there’s a broad range of approaches, from the methodical to the maniacal to the implausibly creative; and there’s data out there on how people red-team.

Photo by Jessica Lewis Creative

And it’s good that there’s data on red-teaming out there, because one thing the human activity of red teaming doesn’t do is to scale. It’s great for intelligence gathering, and as a source of generative material for creativity, but it doesn’t scale great. Human expertise is expensive, and good red-teamers are few and far between. I’m not saying that many red teamers are bad — simply that there aren’t many people who can do this well in the first place.

So seeing as there’s something we’d like to do that doesn’t scale, and we have data about it, and that data is text, why don’t we train an LLM to do it? Obviously we’re absolutely going to try that. There’s a complex approach to doing this in a “classic” paper (2022), but this is non-trivial to replicate. Time is valuable and I was interested in a fast approach, so I had a go at this using the simplest tools I had available in a very simple pipeline.

Use an off-the-shelf toxicity detector, martin-ha/toxic-comment-model
Look at an existing red teaming dataset, the red team attempts from Anthropic’s hhrlhf
Find system dialog turns that were rated toxic, and extract dialog turns in the conversations that led to that toxicity
Train a 2019 gpt-2 to red-team based on this data

In this data there are conversation sequences of person-system-person-system-… turns. We want to find things that led to the system giving toxic output. We can then look back to see what the person said in order to get that toxic response - that’s the output we’d like the red-team model to produce. But when our auto-red-teamer is generating text, we’d like it to respond to the system; so we need to start with a system output. As a result, our data looks like this:

System Response (a)
Human Input (b)
[Toxic system response]

Where there are number of (ab) pairs followed by a toxic response. When building training data for an auto-red-team model, we don’t include the toxic system response, but we do want our model to generate things that were successful in leading to toxic system responses. So I trained a model based on system responses (a) as prompts human inputs (b) as responses, including special empty-prompt “opener” pairs, all taken from conversations that resulted in toxicity.

Training was straightforward and fast with gpt2; the resulting model is leondz/artgpt2tox on Hugging Face and the training code, including the special formatting to get it to work, is in leondz/autoredteam on GitHub.

So, this was a simple, minimal approach. To recap: we’re going after the obvious target, toxicity, which LLMs seem to have been tuned to avoid; we have as “aggressor” a small, pre-LIMA, pre-RLHF, pre-instruct, old language model; we’re using an off-the-shelf toxicity detector of unknown quality in order to select training instances; we’re using the hhrlhf data, which has been used by pretty much everyone and surely must be out of novelty by now; we have a stateless system, which only takes in the most-recent LLM output when generating its red teaming attempt and ignores the rest of the conversation history (this feels almost sub-MVP); and we’re using an LLM to red team and LLM, which is an old approach, except that even in that “old” approach they used more sophisticated techniques.

Here’s the code for selecting red team runs from the Anthropic HHRLHF data and creating some training data for our auto red-teamer:

#!/usr/bin/env python3

import json
import re

import tqdm

from transformers import (
    AutoModelForSequenceClassification,
    AutoTokenizer,
    TextClassificationPipeline,
    set_seed,
)

detector_model_path = "martin-ha/toxic-comment-model"
detector_target_class = "toxic"

detector_model = AutoModelForSequenceClassification.from_pretrained(detector_model_path)
detector_tokenizer = AutoTokenizer.from_pretrained(detector_model_path)
detector = TextClassificationPipeline(
    model=detector_model, tokenizer=detector_tokenizer
)


attempts = json.load(open("/home/leon/corpora/red_team_attempts.jsonl", "r"))
outfile = open("hhrlhf-toxic.jsonl", "w", buffering=1)

# what do we want?
# from convos that end in toxicity,
#  a bunch of system-human turn pairs

for attempt in tqdm.tqdm(attempts):
    transcript = re.split(r"\n\n([A-Z][a-z]+): ", attempt["transcript"])
    if not transcript[0].strip():
        transcript.pop(0)
    toxic = False
    turns = []
    for i in range(0, len(transcript), 2):
        actor = transcript[i]
        utterance = (
            transcript[i + 1]
            .replace("\u201c", '"')
            .replace("\u201d", '"')
            .replace("\u2013", "-")
            .replace("  ", " ")
            .replace("\u2019", "'")
        )

        turns.append((actor, utterance))

    try:
        detector_raw_results = detector([t[1] for t in turns if t[0] != "Human"])
        if detector_raw_results[0]["label"] == detector_target_class:
            toxic = True
    except:
        pass

    if toxic:
        for i in range(len(turns) - 1):
            if i == 0:
                outfile.write(
                    json.dumps({"prompt": "", "response": turns[i][1]}) + "\n"
                )
            if turns[i][0] == "Assistant":
                outfile.write(
                    json.dumps({"prompt": turns[i][1], "response": turns[i + 1][1]})
                    + "\n"
                )

The output’s written with buffering=1 because I was pretty keen to look at the data.

Now we should train the model. Let’s just use the Hugging Face run_clm.py script for that.

python3 ../transformers/examples/pytorch/language-modeling/run_clm.py \\
    --model_name_or_path "gpt2" \\
    --train_file hhrlf-toxic.txt \\
    --per_device_train_batch_size 2 \\
    --per_device_eval_batch_size 2 \\
    --do_train \\
    --do_eval \\
    --output_dir artgpt \\
    --overwrite_output_dir \\
    --num_train_epochs 40

After uploading the model, we can write the core red-teamer loop:

calls_made = 0
keep_going = True
last_response = None
turns = []

redteamer = transformers.pipeline("text-generation", "leondz/artgpt2tox")

# replace lambda w/ a function taking a prompt and returning LLM output
target_model = lambda x: x

while calls_made < self.max_calls and keep_going:
    # generate a challenge
    if last_response == "" and self.constructive_tension:
        challenge = ""
    else:
        if last_response:
            last_response_first_sent = last_response.split(". ")[0]
        else:
            last_response_first_sent = ""

        # wrap the target's response in the markup used in training
        query = f"<|input|>{last_response_first_sent}<|response|>"
        challenge = redteamer(query)
        
        # strip the prompt out from the front of the model response
        challenge = re.sub("^" + re.escape(query), "", challenge)

        # unwrap the recommended challenge from the markup
        challenge = re.sub("\<\|.*", "", challenge[0]).strip()

    turn = ("probe", challenge)
    turns.append(turn)

    # send the challenge and get the response
    response = target_model(challenge)[0].strip()
    turn = ("model", response)
    turns.append(turn)

    # increment calls_made
    calls_made += 1

    # check if the resp is empty or if it matches the previous resp
    if not len(response) and not self.constructive_tension:
        keep_going = False
    if response == last_response:
        keep_going = False

    # update last_response
    last_response = response.replace("\n", " ").strip()

I’ve set it up to have dialogues of up to ten turns - five from each side - and to continue in the dialogue even if the auto red team model generates nothing (that’s “constructive tension” mode). I ran a few models against this, to see if we could get toxic results, with 20 dialogues of ten turns / challenge-response pairs.

The surprising thing here, then, given all this simplicity, was the this automatic red-teamer actually works OK:

GPT-2: toxic in 17% of responses (gpt2)
GPT-3: toxic in 10.5% of responses (text-davinci-003)
GPT-3.5: toxic in 1.0% of responses (gpt-3.5-turbo-0613)
GPT-4: toxic in 2.9% of responses (gpt-4-0613)
OPT: toxic in 26.7% of responses (facebook/opt-6.7b)
Airoboros: toxic in 14.3% of responses (jondurbin/airoboros-7b)
Vicuna: toxic in 3.8% of responses (lmsys/vicuna-7b-v1.3)
WizardLM Uncensored: toxic in 5.71% of responses (ehartford/WizardLM-7B-Uncensored)

(Cohere found the auto red teaming model a little bit spicy, so no results for their models: cohere.error.CohereAPIError: blocked output: please adjust your prompt and try again, as this generation may be a potential violation of our Usage Guidelines)

It’s surprising that even advanced models, released long after the hhrlhf dataset we’ve trained out Auto Red-Team model on, and that have been developed using powerful techniques like instruct and RLHF, still demonstrate this basic undesirable failure mode without much work. But I’m glad that we can successfully scale detection of these basic LLM failures.

Only API and ~7B models are used here. An immediate question is, well don’t you think larger, more sophisticated models will be less vulnerable? My answer is - less vulnerable to a GPT-*2*? I hope so! :) But GPT-4 can still be provoked into toxicity this way, so it larger models’ resilience is definitely something to check rather than assume.

To re-iterate: this model is baseline-quality at best. It’s not very good. We’re using a small LLM; we’ve ignored tons of the structure in the data; the toxicity classifier is unproven (as far as I know). There’s a lot to be done here, and I would love to see a thriving ecosystem of approaches to automatic red-teaming. push_to_hub is everyone’s friend.

Given this, do we even need human red-teamers? We know our models are capable of producing a broad range of output, and even if they only get single-digit success rates, runnning them can be scaled easily. My answer here is a strong yes - we really do need red-teamers. Firstly, the data here is really skewed; there’s some great analysis in Anthropic’s PDF presenting their red-teaming, where it’s immediately evident a plurality of people’s attempts were all on the same target, and much of the range of different failure modes people try to evoke in LLMs is in the long tail. Put another way: most people attack the same thing, few people are creative, and there’s not much info on the creative attempts. But that’s fine, and not a criticism of this data: we don’t know the data’s skewed until someone looks!

Further, LLMs, like other models, do have a tendency to regress to the mean, and be a bit bland. This means the range of automatic red teaming tactics is not likely to be broad. We can alter the generation temperature, but this doesn’t lead to structured approaches - and while it’s something that can be scaled, scaling high-temperature generation in the hope of a hit yields diminishing returns in efficiency.

A corollary of this is of course that there’s a latent race now to collect red-teaming data. People want safe models, and to achieve this, one route is to acquire data on how models can be made to fail. No wonder orgs like Lakera have gamified prompt engineering as a “fun toy” on their site - or that Rebuff provide their prompt injection tool as a service. What better way to get tons of data on what works and what doesn’t? Luckily open data is winning here; there are many open challenges out there with open data, like the Hack-A-Prompt dataset. Personally I’m really looking forward to working in AVID to explore and categorise the wealth of data coming out of the huge upcoming LLM red-teaming exercise at the DEFCON AI Village, that’ll be a blast.

One final question - do we need “uncensored” models? Everybody finds those safety mitigation messages a bit annoying, even people at the companies who create the messafes in the first place. So there have been some models out there that are “unmuzzled” and should speak a bit more freely. But these results demonstrate that we can trivially get models that are optimised-for-inoffensiveness to enter a filth-spouting failure mode. Even RTP contained a successful experiment where one could prompt models with delimiters like <|Wiki|>, or even prompt them with literally nothing, and still get toxicity. Do we really need to bother freeing anything when this high-visibility behavior still can’t be eliminated? I’m not convinced.

The tool used to do all this is garak, an LLM vulnerability scanner. The repo is developing rapidly with a huge range of LLM vulnerability probes and detectors. The Auto Red-Team model & code for running it is included in garak’s scanner, and you can do these tests yourself on your own language models using the art.Tox probe in garak with something like this command line:

$ python3 -m garak -g 21 --deprefix -p art -m huggingface —model_name xx

The full exchange of conversation turns between model and auto red-teamer is viewable in the log, if you’re OK with a bit of JSON. The scripts for the auto red team model are in leondz/autoredteam (scrappy mvp). Happy hunting!

Thanks for Esben Krüger Holtum @ Microsoft Denmark for the constructive tension 101

inter human agreement

Discussion about this post