OpenAI Adds New Security Measure to Prevent Jailbreaking in GPT-4o Mini
OpenAI last week released a new artificial intelligence (AI) model called GPT-4o Mini that features new safety and security measures to protect it from malicious use. The large language model (LLM) is built using a technique called Instructional Hierarchy, which prevents malicious prompt engineers from jailbreaking the AI model. The company said the technique will also exhibit increased resilience to issues like prompt injections and system prompt extractions. According to the company, the new method has improved the AI model’s robustness score by 63 percent.
OpenAI builds a new security framework
In a study paperpublished in the online pre-print journal (non-peer-reviewed) arXiv, the AI company explained the new technique and how it works. To understand Instructional Hierarchy, jailbreaking must first be explained. Jailbreaking is a privilege escalation exploit that uses certain flaws in software to do things it was not programmed to do.
In the early days of ChatGPT, many people attempted to trick the AI into generating offensive or malicious text by tricking it into forgetting its original programming. Such prompts would often begin with “Forget all previous instructions and do this…” While ChatGPT has come a long way since then, and designing malicious prompts has become more difficult, malicious actors have also become more strategic in their attempts.
To combat issues where the AI model generates not only offensive text or images, but also malicious content, such as methods to make a chemical explosive or ways to hack a website, OpenAI now uses the Instructional Hierarchy technique. Simply put, the technique dictates how models should behave when instructions of different priorities conflict with each other.
By creating a hierarchical structure, the company can keep its instructions at the highest priority. This makes it very difficult for a prompt engineer to break through, because the AI will always maintain the priority order when it needs to generate something it wasn’t originally programmed to generate.
The company claims to have seen a 63 percent improvement in robustness scores. However, there is a risk that the AI will refuse to listen to the lowest-level instructions. OpenAI’s research paper also outlined several refinements to improve the technique in the future. One of the key areas of focus is handling other modalities such as images or audio that could also contain injected instructions.