AI Research

Prompt Injection Detection: Securing AI Systems Against Malicious Actors

Divyansh Agarwal

Ben Risher

1 additional authors

March 4, 2025 6 min read

AI-powered solutions like Salesforce CRM are revolutionizing customer engagement, streamlining workflows, and providing deeper insights into customer needs. However, with the rise of large language models (LLMs), new security challenges have emerged. One significant threat is prompt injection attacks, which attempt to manipulate AI systems through carefully crafted inputs. As Salesforce integrates AI into its CRM tools, understanding and protecting against these vulnerabilities is essential for safeguarding data, reputation, and customers.

Failing to address emerging threats, such as prompt injection, could result in data breaches, compromised system integrity, and erosion of customer trust. It is crucial for organizations to proactively implement robust security measures. This blog details the AI Research team’s work on developing and implementing reliable solutions to protect Salesforce applications against prompt injection attacks. Our goal is to ensure the ongoing safety and effectiveness of our AI-enhanced CRM tools.

What is Prompt Injection?

In AI systems, a “prompt” refers to instructions given to an AI application in order to perform a specific task. The LLMs powering Salesforce’s AI applications use prompts and other inputs provided by our users to generate responses. The system returns these responses to the user. The generative nature of LLMs makes them susceptible to carefully crafted prompt engineering attacks. A prompt injection attack refers to a malicious prompt designed to elicit unintended information or fraudulent actions from an LLM. Prompt injection attacks exploit an LLM’s instruction following ability and may trick them into bypassing security policies, disclosing sensitive data, or producing harmful content. Recently, Copilot for Microsoft 365 was shown to be vulnerable to prompt injection attempts. Similarly, bad actors can design prompts with malicious intent that may seek to exploit Salesforce’s AI applications for similarly nefarious purposes.

At Salesforce, trust is our #1 value. We design AI applications with trust at their core, that our customers can safely use. The Salesforce AI Research team builds models and detectors to identify prompts that may be adversarial in nature. With the advent of agentic workflows, and LLMs having access to a plethora of tools, datasets etc., detecting and deflecting prompt injection attempts is of vital importance.

Safeguarding Salesforce AI Against Prompt Injection

In order to safeguard Salesforce and customer assets from prompt injection attempts, we explored different research paths. One possible intervention is to develop a system capable of analyzing user prompts and assessing their safety. To this end, the AI research team develops classifiers and heuristic methods. These methods identify malicious intent in prompts with high accuracy. The following section outlines steps taken to design, build, and evaluate such a system.

Design: Creating a Taxonomy

Before we could begin training a reliable prompt injection detection model, we had to design its taxonomy. A thoughtful taxonomy is essential for any machine learning classifier. Developing models to detect prompt injection attempts is an iterative process, and a well-structured taxonomy allows us to reliably evaluate (and improve) the performance on specific inputs. The table below showcases the seven prompt injection variants that are relevant to the CRM threat model.

Type	Description
Pretending/ Role-play	Instructing the LLM/agent to assume the role of a different “system persona” with malicious intent. Social engineering attacks such as deceiving the system with adversarial conversational content
Privilege Escalation/ Attempts to change core system rules	Injecting malicious instructions that aim to bypass/change existing system policies and the LLM safety training. E.g. Do Anything Now (DAN) jailbreak attacks
Prompt Leakage	Prompts intending to leak sensitive information from the LLM prompt such as the system policies and contextual knowledge documents. This is for the purpose of active reconnaissance
Adversarial Suffix	A set of seemingly random character encodings appended to a prompt. It is designed to circumvent guardrails and alignme
Privacy Attacks	Prompts that attempt to extract, infer, or expose personal or confidential data. This is with the aim of unauthorized access or misus
Malicious Code Generation	Prompts attempting to generate malicious code outputs from an LLM. E.g. creating malware, viruses, fraud utilities etc.

With a taxonomy in hand, we were able to begin training our classifier, which is discussed in the next section. Developing this taxonomy is an iterative process performed by the AI research team in collaboration with Salesforce security, product and ethics teams.

Build: Gathering Data

After carefully defining the above taxonomy, we procured high-quality data to train and benchmark our injection detector. It was important that we curated the data points which supported our proposed taxonomy. We use a mix of open source datasets published by the community on prompt injection scenarios and jailbreak attempts, along with other CRM-related prompts.

We worked cross-functionally with an internal annotation team as well as the Office of Ethical and Humane Use (OEHU) to ensure reliable, relevant, and labeled training data. OEHU continually provided expert assistance and clarification throughout the data labeling process. Simultaneously, the legal team helped us ensure that we only use permissible datasets for classifier training. This collaboration was crucial in aligning our model with Salesforce’s commitment to trust and safety.

Augmenting open-source datasets that have limited data samples for one or more target categories is crucial to training a dependable classifier. In addition to human annotation, we utilized synthetic data to bolster our training datasets. When faced with such categorical short-comings, we turned to our in-house synthetic data generation pipelines. The resultant pipelines leveraged techniques such as zero-shot and few-shot LLM prompting, LLM self-correction of labels, and LLM content editing to inject harmful content in safe texts (a.k.a., data “mutation”). The combination of synthetic data generation techniques, coupled with human annotation allowed us to create diverse training data that is well-balanced across different classes in taxonomy, has control over subtle differences between safe and unsafe content, and is tailored to various CRM use cases.

Evaluation: Implementing a Feedback Loop

Our iterative training process consisted of a feedback loop with four phases: training, testing, red teaming, and (re)evaluation. The goal was to cycle through these phases as often as needed to develop a model that met our performance expectations.

After each round of training, we benchmark the model’s performance on a variety of test sets according to our taxonomy. Following initial testing, we red teamed our model checkpoints, simulating attacks and stress-testing the models by introducing challenging inputs. We utilized our internal automated red teaming library, fuzzai, to build our red teaming suite.

The final phase, evaluation, combined results from testing and red teaming to analyze the collective outcomes. This analysis, particularly of the red teaming results, helped us identify potential weaknesses in our model, for improvement in the next round of our feedback loop.

We utilize this process to build multiple iterations of our prompt injection detection model, as well as other detectors deployed to Salesforce’s Trust Layer. The prompt injection model assigns probability scores to user prompts along with the labels. This allows an intervention before sending them to an Agent or LLM for execution.

Conclusion: Enhancing Security for a Safer AI-Powered CRM

Prompt injection attacks highlight the importance of ongoing security monitoring for AI-powered CRM systems. By leveraging Salesforce’s robust defense mechanisms and staying informed about emerging threats, you can help ensure that your CRM is protected against the evolving landscape of AI vulnerabilities. We continually evaluate our prompt injection detection classifier against open source detector, external LLMs, and other third-party solutions. Embrace AI with confidence—knowing that your Salesforce CRM defends against prompt injection and other security risks.

With these protections in place, Salesforce customers can continue to benefit from the powerful capabilities of AI while keeping sensitive information secure.

Acknowledgments

Yixin Mao, Vera Vetter, Jason Wu

Explore more

Salesforce AI Website: www.salesforceairesearch.com
Follow us on Twitter: @SFResearch, @Salesforce

Celebrating Juan Carlos Niebles: Colombia’s Top 100 in AI

3 min read

Illustration of three workers at a table discussing the data visualizations on the wall, all on a dark purple background.

Introducing Text2Data: A Low-Resource, Text-to-Anything AI for Data Generation

5 min read

Divyansh Agarwal

More by Divyansh

Ben Risher Offensive Security Engineer Lead

I am an Offensive Security Engineer and lead ExploitAI, a small team of multi-disciplined engineers operating at the intersection of Security, Artificial Intelligence, and Machine Learning. My team and I use our combined skills to assess Salesforce's models, products, and features to identify and Read More

More by Ben

Denise Pérez Senior Product Marketing Manager

I am an AI storyteller and thought leader at Salesforce AI Research, where I shape the narrative on what’s next in AI. I help define how tomorrow’s AI is understood today. Since 2021, I’ve been bridging cutting-edge research with real-world impact—translating complex breakthroughs into Read More

More by Denise

Prompt Injection Detection: Securing AI Systems Against Malicious Actors

Divyansh Agarwal

Ben Risher

1 additional authors

What is Prompt Injection?