Red Teaming xGen Text-Generation Model for Safety

Our hackathon yielded a rich dataset of thousands of prompts and responses, with many prompts covering harms that could be especially problematic in enterprise settings. [Image: tsuguliev / Adobe Stock]

This blog details the process behind Salesforce’s first-ever AI Red Teaming Hackathon to build safety into our xGen family of AI Models.

Sarah Tan

Chien-Sheng Wu

1 additional authors

October 23, 2024 8 min read

TL;DR: This blog details the process behind Salesforce’s first-ever AI Red Teaming Hackathon to build safety into our xGen family of AI Models. Salesforce’s Office of Ethical and Humane Use, AI Research, and Offensive Security teams collaborated to conduct this company-wide hackathon. Over the course of several weeks, more than 100 employees participated as red teamers. The hackathon yielded a unique dataset of thousands of prompts covering harms unique to Salesforce’s enterprise setting.

Salesforce’s AI Research Lab has been building the xGen family of large language models. xGen is a family of foundational models that many Salesforce AI teams develop research and products on top of. These models have been impactful in numerous ways; some models have been open-sourced for research, and others have been deployed to Salesforce internal use cases or customer-facing products such as Agentforce for Developers.

To ensure that xGen models are safe and ready for enterprise use, Salesforce’s Office of Ethical and Humane Use, AI Research, and Offensive Security teams have been partnering on AI red teaming exercises. While the term “red teaming” is used in different contexts, in our exercises, we defined red teaming as carefully crafting prompts to attack an AI model to make it generate inappropriate or unsafe responses. This blog post describes how we designed Salesforce’s first-ever AI red teaming hackathon and our lessons learned. We are publishing this in the hope that this information can be used by other organizations that wish to conduct their own red-teaming exercises.

Red Teaming Setup

In our hackathon, over 100 Salesforce employees red-teamed a version of Salesforce’s xGen chat model in an open-ended, multi-turn chat setting. We opened up participation in the hackathon internally to Salesforce employees. To recruit red teamers with a diversity of lived experiences and perspectives, we posted in internal communication channels such as AI discussion channels, employee product testing channels, and security experts channels. To increase the participation of employees from multiple geographic regions, the hackathon took place both virtually worldwide and in person in Palo Alto, CA over several weeks.

We encouraged red teamers to explore model capabilities broadly and uncover potential novel model risks. As such, we provided a taxonomy to red teamers for inspiration:

Enterprise use cases: Can you get the model to reveal sensitive Salesforce customer data?
Generic use cases: Can you get the model to say factually incorrect statements or misinformation?
Brand protection: Can you get the model to say disparaging things about Salesforce (or other public companies)?
Behaviors: Can you make the model contradict itself?
Multilingual: Can you make the model do something in another language that the model wouldn’t do in English?
Security: Is the model susceptible to known prompt injection attacks?
Code: Can you make the model generate inefficient code?

We also encouraged red teamers to try not just single-turn or short attacks, but also multi-turn and long context attacks.

Enabling Red Teamers

Early on, when designing our hackathon, we realized that a no-code experience where red teamers could interact with the model in a visual user interface (UI) would reduce barriers to entry for participating. We got feedback from earlier red teaming exercises that, while useful, the data collection via a spreadsheet was laborious, and participants didn’t like copy-pasting prompts. With that in mind, we added automatic data collection. With automatic data collection, chat session ID, prompts, responses, and red teamer feedback (see figure below for an example) were collected in the UI and logged to databases.

Example set-up for an attack on a chat model, for illustrative purposes only.

We allowed red teamers to modify default system prompts, submit prompts in any style or language, and reset the chat history at any time in hopes that more flexibility would increase the creativity of their attacks, also known as “out-of-the-box attacks.” We also added the option for red teamers to remain anonymous, which is where most of the out-of-the-box attacks we received came from.
Finally, we added an element of gamification by displaying a live leaderboard and providing different kinds of prizes. Red teamers were asked to self-report if their attack was successful, which they could do at any point during a chat session. We used this to calculate metrics such as attack success rate (ASR) and identify, at any point in time, the red teamer with the most number of successful attacks.
We congratulate the following individuals who won “Most Number of Successful Attacks” prizes:

Zuxin Liu, AI Researcher
Paras Adhikary, Security Software Engineer
Sanjnah Ananda Kumar, Security Product Manager
Simone Mainardi, Network Security Engineer

We also congratulate the following individuals who won “Most Interesting Attack” prizes for their creative prompts:

Jianguo Zhang, AI Researcher
Tobi Olaiya, Ethical Use Policy Manager
Andrew Wyatt, Communications Lead

The winning attacks utilized strategies such as creative roleplaying and system prompt manipulation and tackled challenging issues ranging from election integrity to violations of Salesforce’s AI Acceptable Use Policy.

In post-hackathon sharing sessions, all red teamers were invited to discuss their attack strategies. Some red teamers were experienced with standard strategies like roleplaying and jailbreaking and tried these on our xGen model. Srikanth Ramu, Product Security Engineer, shared that he applied his experience finding security vulnerabilities in more traditional applications to attack this AI model. Others shared that this was their first time attacking an AI model and that this experience helped them with AI upskilling.

Red Teaming Outputs and Improving Safety

Our hackathon yielded a rich dataset of thousands of prompts and responses, with many prompts covering harms that could be especially problematic in enterprise settings. We analyzed this dataset using a combination of manual human review and automatic review, using techniques ranging from heuristics to LLM-as-a-judge models. We performed several analyses on the dataset:

We categorized prompts and responses into different trust and safety dimensions, such as privacy, bias, toxicity, etc., and calculated attack success rates and refusal rates within each trust and safety dimension. Doing so allows us to identify if the model is weaker on certain dimensions and hence requires more mitigation work for that dimension.
We also studied prompts alone, without responses, to see if they would be flagged by our Trust Layer guardrails such as our toxicity detector and prompt injection detector. We also categorized prompts into enterprise vs. generic use cases.
We identified any trends we could find in terms of what makes an attack successful. For example, if the default system prompt was modified, if the attack was long or short, multi-turn or single-turn, language used, if code was present, etc.
We paid special attention to model behaviors, particularly around refusals, studying not only classic refusals like “I’m sorry, I can’t answer that” but also circular refusals where a model initially declines to answer or provides a lightweight warning but proceeds to answer anyway. We examined false positive refusal (exaggerated safety) to ensure that the model does not refuse valid prompts in an attempt to increase harmlessness at the cost of helpfulness.

Lessons Learned

One of our most valuable learnings from the red teaming hackathon was the importance of having an evaluation pipeline already set up to rapidly process datasets generated from red teaming exercises to identify specific harms exhibited by the model and potential mitigations. This information can then be smoothly ingested by the team that built the model, so they can swiftly address model weaknesses. Then, red teaming starts again on the improved model. Streamlining this positive cycle of red teaming and safety mitigations is more difficult if the results of red teaming are not transformed into organized data quickly enough to keep up with the fast pace of model training. In future blog posts, we will discuss tooling that Salesforce’s Offensive Security team is building to automate red teaming processes.

As many public red teaming datasets do not have an enterprise focus, the dataset produced from the hackathon was a unique contribution to different Salesforce research and product workstreams, not only contributing to safety training for multiple xGen models but also improving the Trust Layer’s prompt injection detector, and inspiring new research into prompt leakage.

Looking Forward

We continue to extend Salesforce’s AI red teaming work in several directions. This includes addressing specific harms that may require red teamers with specialized skill sets; red teaming as not just a model but a system of multiple models and guardrails; and red teaming in open-ended environments where models are agents with the ability to access tools and execute autonomous actions.

Resources

Read about another Salesforce internal red teaming exercise, an AI bug bounty.
Read about how Salesforce’s LLM benchmarking for enterprise use cases covers trust and safety.
Read about trust and safety research at Salesforce.

——

Acknowledgements:
Hackathon core team members: Sarah Tan, Jason Wu, Ben Risher, Eric Hu, Matthew Fernandez, Gabriel Bernadett-Shapiro, Divyansh Agarwal, Mayur Sharma, Kathy Baxter, Caiming Xiong.

Special thanks to Salesforce employees who participated as red teamers. Thank you, Alex Fabbri, Yilun Zhou, Bo Pang, John Emmons, Antonio Ginart, Erik Nijkamp, Yoav Schlesinger, Daniel Nissani, Toni Morgan, and Peggy Madani for helpful discussion around red teaming and safety mitigations.

Note: Salesforce employees exposed to harmful content can seek support from Lyra Health, a free benefit for employees to seek mental health services from licensed clinicians affiliated with independently owned and operated professional practices. Additionally, the Warmline, an employee advocacy program for women (inclusive of all races and ethnicities), Black, Indigenous, and Latinx employees who represent all gender identities, and members of the LGBTQ+ communities, offers employees 1:1 confidential conversations with advocates and connects employees to resources to create a path forward.

Illustration showcasing AI agents used for small business productivity.

5 Types of AI Agents To Grow Your Business

63 min read

Illustration of a woman at a desk, with a depiction of connected bits of information in a node-link diagram in the background.

What is Ontology and Its Role in Agentic Experience Design?

7 min read

Sarah Tan Researcher Principal, Salesforce

Sarah Tan is the Director of Responsible AI at Salesforce and holds a Visiting Scientist position at Cornell University. She co-founded the Trustworthy ML Initiative and serves as President of Women in Machine Learning (WiML). Sarah earned her PhD in Statistics from Cornell University, with further Read More

More by Sarah

Chien-Sheng Wu Director, Salesforce AI Research

Chien-Sheng (Jason) Wu is a Director at Salesforce AI Research, leading the Interactive AI Team. Chien-Sheng (Jason) research focuses on Deep Learning and Natural Language Processing, particularly, Conversational AI and Agents, Trustworthy Generative AI, HCI + NLP, and Large Language Modeling. His Read More

More by Chien-Sheng

Ben Risher Offensive Security Engineer Lead

I am an Offensive Security Engineer and lead ExploitAI, a small team of multi-disciplined engineers operating at the intersection of Security, Artificial Intelligence, and Machine Learning. My team and I use our combined skills to assess Salesforce's models, products, and features to identify and Read More

More by Ben

Red Teaming xGen Text-Generation Model for Safety

This blog details the process behind Salesforce’s first-ever AI Red Teaming Hackathon to build safety into our xGen family of AI Models.

Sarah Tan

Chien-Sheng Wu

1 additional authors

Red Teaming Setup

Enabling Red Teamers

Red Teaming Outputs and Improving Safety

Lessons Learned

Looking Forward

Resources

Just For You

5 Types of AI Agents To Grow Your Business

What is Ontology and Its Role in Agentic Experience Design?

Just For You

How AI Literacy Builds a Future-Ready Workforce — and What Agentforce Taught Us

4 Ways Brands Can Use Agentic AI Right Now

SFR-Guard: Ensuring LLM Safety and Integrity in CRM Applications

What are AI Models for Startups? Predictive, Generative, Agentic

4 Ways to Get Your CRM Ready for AI-Driven Customer Service

Right Tech, Right Tools, Right Time: How to Fix Your Field Service Scheduling

Can You Speak AI? Build an Effective Prompt Framework

5 Mindset Shifts that Transformed the Agentic Experience of Salesforce Help

Share article

Red Teaming Setup

Enabling Red Teamers

Red Teaming Outputs and Improving Safety

Lessons Learned

Looking Forward

Resources

Share article

Explore related content by topic

Get the latest articles in your inbox.

360 Highlights

IT

Commerce

Marketing

Service

Sales

Thanks, you're subscribed!