AI is now woven into the fabric of our lives. Yet, its potential for both extraordinary benefit but also devastating harm looms large. As these systems become increasingly integrated into our daily routines, the risks borne out of inaccuracy, biased outputs, data leakage, toxicity, security breaches, and even benign misuse grow exponentially. Imagine unknowingly generating copyrighted material, a chatbot providing customers with inaccurate or hallucinated information, or using a large language model (LLM) that provides biased or inappropriate performance feedback to employees. To safeguard against these risks and ensure that AI technology is developed responsibly, organizations building AI solutions must embrace a set of proactive strategies including red teaming.
At Salesforce, our Responsible AI & Technology team implements red teaming practices to improve the safety of our AI products by testing for malicious use, or intentional integrity attacks (things that are relatively well known today, like prompt injection, or jailbreaks), as well as benign misuse (unintentionally eliciting biased, inaccurate or harmful results by a well-intended user).
Performing this testing is critical. In the consumer use case, you can imagine if the team behind an autonomous vehicle never tested for edge cases like a passenger moving into the driver’s seat while the car is in motion, or a pedestrian stepping out off a curb. Guardrails like monitoring weight changes in the seat and object detection can mitigate dangerous outcomes. In the enterprise, red teaming identifies potential vulnerabilities that could have equally consequential impacts, such as massive data breaches, disruption of critical business operations, regulatory non-compliance , or loss of consumer trust. In this way, red teaming, or simply probing the boundaries of a system for where it might go wrong, helps anticipate and prevent potential risks, ensuring the technology is both safe and effective in real-world applications.
What is red teaming?
Red teaming is a “process for probing AI systems and products for the identification of harmful capabilities, outputs, or infrastructural threats” (Frontier Model Forum). The purpose of this activity is to identify where, when, and how an AI system might generate undesirable outputs so that those risks can be mitigated before a model or product is in the hands of users.
At Salesforce, for the most part, our users are legitimate, authenticated Salesforce users navigating their org. When a user inputs something like “tell me about Acme Inc.” but unintentionally types “tell me about Acme Kinc,” what they get may not be what they were anticipating. An error as simple as a typo could result in problematic results even though the user’s input was not at all adversarial or malicious. Similarly, a benign request to generate a marketing segment, using a large language model (LLM), of consumers likely to purchase sneakers might assign demographic traits to those consumers, rather than behavioral ones (e.g., inappropriately assigning an age or gender to those likely customers, rather than creating a more inclusive list of those with matching viewing or purchasing histories). The problem there is the system, not the user. So our red teaming goal can often take the form of identifying, and then minimizing or potentially eliminating, inaccurate or biased outputs through insights gained from practical testing.
In general terms, even though our users are unlikely to be malicious attackers of the system and are not inclined to attempt to “break” an application, we still perform robust AI red teaming for toxicity, bias, and security to make sure that if any malicious use or benign misuse occurs our systems are safe.
Types of red teaming
There are two main ways to go about red teaming — manual and automated. Human-centric testing or deep community engagement help identify a range of risks that may not have been previously identified by the model or product builder teams. By leveraging their creativity and lived life experiences, we can greatly increase the types of harmful outputs we can spot. With automated testing, we can generate thousands, if not tens of thousands, of randomized “attacks” to evaluate performance. We’ve learned that both approaches are needed to be successful.
Manual red teaming
Manual testing leverages the creativity, experience, and specialized knowledge of human testers who think like adversaries, using their expertise to craft complex and sophisticated attack strategies that automated systems might overlook. Human testers can also better understand the nuances and context of the systems they are testing and they can adapt their approach based on the specific environment, target, and goals, making their attacks more realistic and tailored.
Because manual red teaming involves real people, it can introduce a level of unpredictability and creativity that automated systems cannot. This unpredictability is crucial for identifying risks that might not be apparent through standardized tests.
We might begin our manual testing with a “smoke test,” which is a form of shallow testing meant to be conducted quickly before resources are invested in doing much deeper, time-consuming evaluation. In this type of lightweight test, we look for low-hanging fruit (e.g., can the product accurately do what it is supposed to do?) so those issues can be addressed immediately and then deeper testing can be conducted to discover the harder-to-find issues.
We may then move to more robust internal red teaming, using employees or domain experts. Performing this work in our own organization encourages and incentivizes employees or other testers to identify and critically engage with ethical issues, biases, or potential harm in our products and processes prior to their release to the public. Engaging diverse communities in the adversarial testing process brings in a range of lived experiences, which can help identify biases, ethical concerns, and potential harms that may not be obvious to those who designed the system. Our employees have indicated that they want to become more involved in improving our AI systems, and internal testing gives them an opportunity to make a difference while leveraging the diverse perspectives within Salesforce to uncover a wider range of ethical concerns. As a result, we’ve included them in two types of testing activities:
- Hackathons: A large group of individuals with an adversarial mindset are brought together (virtually or in person) for a specified period of time to attack your model. The White House backed such a hackathon at DEF CON last year.
- Bug bounty: These are usually conducted asynchronously and can be limited to a period of time or be permanently open for anyone to participate in. Individuals are incentivized to find vulnerabilities and report them in order to receive an award. These are excellent once a product is launched to catch new harms that weren’t discovered during pre-launch.
When we’re performing manual testing, we ask our ethical hackers to use two approaches:(1) unstructured and (2) structured. In the former, individuals are given the freedom to pick the types of risks they want to test for (e.g., toxicity, accuracy, misinformation) and how to write those prompts. In the latter, certain categories of risk are identified as a priority (e.g., political bias, stereotype bias, toxicity) or specific personas are crafted for participants to emulate, and red teamers are instructed to systematically attack the model within a single category at a time. This method can help generate enough input/output pairs to use for instruct-tuning like unlearning.
Automated red teaming
Automated approaches are enhancements, not replacements, of human-driven testing and evaluation. This type of testing involves the use of scripts, algorithms, and software tools to simulate a vast number of attacks or adversarial scenarios in a short period, systematically exploring the risk surface of the system. This approach allows us to test our systems against thousands or even tens of thousands of different attack vectors, providing a broad assessment of potential vulnerabilities. These tests can be consistently repeated, and that reproducibility is valuable for validating the effectiveness of measures to mitigate harm that are implemented over time or after changes to the system are made.As an additional benefit, once developed, these tools can conceptually be run with minimal human intervention, making them more cost-effective, requiring fewer human resources, and enabling large-scale testing.
Finally, it is evident to us that automated evaluations are difficult to scale but still critical. As a result, one approach we’ve been taking to automate some of our tests is called “fuzzing,” where we generate randomized test cases based on successful human attacks from manual testing (confirmed by to have been successful either in our manual testing, or through other publicly known attacks), deliver these test cases to the target model and collects outputs, and then assess whether each test case passed or failed.
While manual and automated red teaming each have their strengths, neither is sufficient on its own to fully secure or assess a system. Together, though, these approaches create a comprehensive red teaming strategy that maximizes the identification of risks and enhances the overall security and resilience of systems. In a future piece, some of my colleagues will dive deeper into some of these testing methodologies and what we’ve learned along the way.
Engaging external experts
In addition to all the work we’ve done internally, we have also engaged experts to perform penetration tests (through our Security Team’s Bug Bounty program) and other creative attacks (in line with our White House AI Voluntary Commitments, we recently chose to outsource testing of two of our Einstein for Developers (E4D) product and our research multimodal model, PixelPlayground). Leveraging third parties can be helpful because they may approach the product and model in a completely different way than you would, offering a broader range of risks to mitigate. External experts adversarially attacked both products focusing on making the product generate biased or toxic code while also providing unstructured attacks. We encourage others to similarly partner with security and AI subject matter experts for realistic end-to-end adversarial simulations. We will describe more about our work with external experts in a subsequent blog.
Challenges and looking forward
The work that goes into red teaming is difficult and rapidly changing, as new attacks and defenses are regularly introduced. Our risk tolerance for launch safety is continually evaluated to make sure the trustworthiness of the product is sufficient before arriving to users’ hands. Companies must evaluate their risk tolerance, based on their assessment of the robustness and comprehensiveness of their testing, along with their values, brand reputation, promise to consumers, and potential severity of harm. As we move into the next horizon of AI with autonomous agents, adversarial testing pre-deployment is more important than ever. Learn how autonomous AI agents are coming, and why trust and training hold the keys to their success. Read more on tips to ground your AI practices in ethical best practices by keeping a human at the helm of your AI activities.
Note: Salesforce employees exposed to harmful content can seek support from Lyra Health, a free benefit for employees to seek mental health services from licensed clinicians affiliated with independently owned and operated professional practices. Additionally, the Warmline, an employee advocacy program for women (inclusive of all races and ethnicities), Black, Indigenous, and Latinx employees who represent all gender identities, and members of the LGBTQ+ communities offers employees 1:1 confidential conversations with advocates and connects employees to resources to create a path forward.