February 28, 2024 at 06:17PM
University of Maryland computer scientists have developed BEAST, a fast adversarial prompt generation technique for large language models like GPT-4. This method yields an 89% success rate in just one minute, using an Nvidia RTX A6000 GPU. BEAST can create readable, convincing prompts that elicit inaccurate responses or reveal privacy issues, emphasizing the need for safety training and guarantees.
The meeting notes cover a new technique developed by computer scientists from the University of Maryland called BEAST, which is a fast method for generating adversarial attack prompts to elicit harmful responses from large language models (LLMs). The researchers achieve a 65x speedup over existing gradient-based attacks, and their approach requires an Nvidia RTX A6000 GPU with 48GB of memory and as little as a minute of GPU processing time.
The BEAST technique uses a beam search to generate harmful prompts that can elicit problematic responses from LLMs, and it significantly outperforms existing methods in terms of attack success rate. The authors also highlight the applicability of their method to publicly available models like GPT-4, as it does not require access to the entire language model.
One potential concern raised in the meeting notes is the readability of the adversarial prompts, as readable prompts have the potential to be used in social engineering attacks. Additionally, BEAST can be used to craft prompts that elicit inaccurate responses or conduct a membership inference attack with potential privacy implications.
The notes stress the importance of safety training for AI models to mitigate the effectiveness of such attacks and to ensure the safe deployment of more powerful AI models in the future. They also mention that BEAST has a lower success rate on LLaMA-2, an AI model designed with safety training efforts, highlighting the potential effectiveness of such safety measures.