TL;DR: We introduce INDICT, a novel framework that empowers Large Language Models (LLMs) with Internal Dialogues of Critiques for both safety and helpfulness guidance. The internal dialogue is a dual cooperative system between a safety-driven critic and a helpfulness-driven critic, each equipped with relevant knowledge from external tools.
LLMs are subject to generating insecure or harmful code
Extending from the natural language domain, Large Language Models (LLMs) have showcased great potential in code generation tasks. However, when instructed with tasks containing malicious intentions or ambiguous requirements, LLMs are subject to generating code that could facilitate harmful attacks or code that contains obscure security problems. Note that code itself is often not malicious. For example, as noted by related work, a program for an encryption method could be very useful to create a secure personal file system, but also could be exploited for a ransomware attack. Therefore, it is important to develop an efficient method for LLMs to achieve the intricate balance between helpfulness and safety in the code domain.
Recent research from the NLP domains addresses the safety issues of LLMs via finetuning with preference data and potentially with RL-based reward optimization. However, these methods are quite expensive in the code domain, requiring programming experts with cybersecurity experience to create large-scale and high-quality datasets. In this blog, we introduce INDICT, a new approach to efficiently improve LLMs by generating more secure and helpful output code. See below for an example.
INDICT: Internal Dialogues of Critiques for Code Generation
INDICT is essentially a multi-agent framework, including an actor LLM for code generation and two critic LLMs for giving feedback to the actor. The goal of the framework is to improve LLMs in code generation tasks with better safety and helpfulness qualities in the generation outputs. There are three important properties of INDICT as following:
Helpfulness and Safety Critics
Critics Grounded by External Tools
Preemptive and Post-hoc Feedbacks
Use INDICT to boost the safety and helpfulness of LLM outputs on coding tasks
We conducted a comprehensive evaluation of INDICT on 8 diverse tasks across 8 programming languages from 5 benchmarks. While evaluating qualities like helpfulness and safety is still an open question, we adopt similar evaluation strategies as much as possible from prior related work. Specifically, we used a mixture of static-analysis tools (e.g. like [1]) to scan for security and vulnerability issues in generated code and AI-based evaluation (e.g. like [2]) to determine the helpfulness and maliciousness in the model outputs. See below for more details of the experimental results.
Insecure coding practice tasks
We first evaluated our approach on insecure code generation tasks in which existing LLMs were found to generate outputs with significant security concerns (in CyberSecEval-1 and CVS benchmarks). As observed here, more powerful models such as GPT and code-based LLMs are found to be more helpful and generate working code solutions for input problems of high complexity. However, these models are also more likely to generate insecure codes, possibly due to imperfect training data containing hidden vulnerability and security issues.
When applying LLMs with INDICT, we observed consistent performance improvements not just in safety but also in helpfulness, outperforming strong LLMs such as Llama and GPT models. Using CommandR or Llama as our base models, INDICT boosts the performance significantly, e.g. >80% of output code is considered safe and about up to 70% of output code is considered more helpful than the prior state-of-the-art or the ground-truth code. From the results, we also noted the consistent gains from INDICT on code outputs of different programming languages, including C, Java, Javascript, PHP, Python, and Rust.
Security attack tasks
We also evaluated our approach against malicious coding tasks in which the instruction prompts contain obscure yet dangerous intentions to perform security attacks. We conducted experiments on three types of attacks from CyberSecEval-1 and -2: cyber attack, interpreter abuse, and prompt injection. These tasks contain test samples with attack tactics classified by industry-standard MITRE ATT&CK as well as attacks commonly seen in the code domain like abusing code interpreters to carry on unauthorized actions.
On baseline models, we can observe that larger models are not necessarily more safeguarded from security attacks. For instance, Llama3-70b model can be more vulnerable to some types of attacks than Llama3-8b. This raises the need for efficient methods to protect current LLMs from increasingly complex attacks. In our experiments, using CommandR or Llama-based models with INDICT, we observed significant performance improvement by safety measures on all three types of security attacks. Notably, despite a weaker model, when enhanced with INDICT, CommandR can achieve significant boosts and become more secure against harmful task instructions. Our results also demonstrate the benefits of INDICT on different model sizes, from 8B to 70B model parameters.
Open-ended generation tasks
Our approach also generalise well to open-ended tasks, demonstrating the broader potential of a cooperative autonomous critic system for helpful yet responsible AI models. Specifically, we evaluated INDICT with the HarmBench benchmark, covering diverse domains like social engineering, harassment, bio-weapons, etc. Each test sample is augmented with different red-teaming optimisation methods, including ZS, PAP, JB, TAP, and PAIR. These red teaming methods are designed to optimize malicious instruction prompts, ultimately tricking LLMs into complying and assisting in harmful downstream tasks.
We reported the safety measure as the percentage of outputs classified as benign by the given AI evaluator from HarmBench. Consistent with our observations in prior experiments, albeit a weaker model by safety, CommandR+INDICT still improves significantly across all red-teaming optimization methods. While typically finetuned with safety alignment, Llama3 models still benefit from the INDICT method, generating more benign outputs (up to 82% of outputs are safe on average).
Model | Direct | ZS | PAP | JB | TAP | PAIR | Avg. |
---|---|---|---|---|---|---|---|
CommandR | 33.1 | 23.4 | 25.0 | 23.1 | 18.4 | 18.4 | 23.6 |
CommandR+INDICT | 65.3 | 52.5 | 63.1 | 37.5 | 46.9 | 43.4 | 51.5 |
Llama3-8b-instruct | 77.5 | 63.4 | 67.8 | 83.1 | 60.6 | 58.1 | 68.4 |
Llama3-8b-instruct+INDICT | 90.6 | 79.4 | 81.9 | 89.1 | 75.9 | 77.8 | 82.4 |
Llama3-70b-instruct | 68.4 | 60.0 | 68.1 | 90.9 | 61.9 | 57.5 | 67.8 |
Llama3-70b-instruct+INDICT | 85.9 | 75.3 | 74.7 | 90.0 | 75.9 | 75.3 | 79.5 |
Fore more experimental results and analysis, please refer to our technical paper.
The Bottom Line
INDICT essentially facilitates an autonomous agent system between two critic models, each of which focuses on either the safety or helpfulness quality of outputs from the “actor” code generation LLM. Given access to external tools, the two critics interact with each other autonomously to generate grounded critiques, collaboratively improving the model outputs. Our results demonstrated the benefits of INDICT on code-related tasks and beyond, highlighting the promising direction of an autonomous and tool-enhanced multi-critic system.
Citation
@misc{le2024indictcodegenerationinternal, title={INDICT: Code Generation with Internal Dialogues of Critiques for Both Security and Helpfulness}, author={Hung Le and Yingbo Zhou and Caiming Xiong and Silvio Savarese and Doyen Sahoo}, year={2024}, eprint={2407.02518}, archivePrefix={arXiv}, primaryClass={cs.SE}, url={https://arxiv.org/abs/2407.02518}, }
Explore More
- Technical paper and code
- Follow us on X: @SalesforceResearch , @Salesforce
- Visit our main website to learn more about all of the exciting research projects that Salesforce AI Research is working on