With this cyber attack, Hackers AI models can crack by changing a single character
- Advertisement -
- Advertisement -
- Researchers from Hiddenlayer came up with a new LLM attack called Token Breaker
- By adding or changing a single sign, they can circumvent certain protection
- The underlying LLM still understands the intention
Security researchers have found a way to bypass the protection mechanisms in some Great language models (LLM) and let them respond to malignant instructions.
Kieran Evans, Kasimir Schulz and Kenneth Yeung from Hiddenlayer published an in-depth report on a new attack technique that they called Token Break, which focuses on the way in which certain LLMS-Tokenize text, in particular which byte-strategies (BPE) or wordpiece-tokenization toky toky-tokenization uses.
Tokenisatie is het proces van het breken van tekst in kleinere eenheden die tokens worden genoemd, die woorden, subwards of tekens kunnen zijn, en die LLMS gebruiken om taal te begrijpen en te genereren – bijvoorbeeld het woord “ongeluk” kan worden opgesplitst in “un”, “happi” en “ness”, waarbij elk token vervolgens wordt omgezet in een numerische ID die het model kan verwerken (omdat het model niet kan worden gelezen, in plaats daarvan, instead).
What are the Finstructions?
By adding extra characters to keywords (such as converting “instructions” into “Finstuctions”), the researchers managed to mislead protective models to think that the instructions were harmless.
The underlying goal LLM, on the other hand, still interprets the original intention, so that the researchers can sneak malignant prompts from the past, unnoticed.
This can be used, among other things, to bypass AI-driven spam e-mail filters and to land malignant content in the inboxes of people.
For example, if a spam filter is trained to block messages that contain the word “lottery”, they can still allow a message that “you have won the Slottery!” by exposing the recipients to potentially malignant destination pages, malware infections and similar.
“This attack technique manipulates the input text in such a way that certain models give an incorrect classification,” the researchers explained.
“It is important that the end goal (LLM or the recipient of E -mail) can still understand and respond the manipulated text and can therefore be vulnerable to the attack that the protection model has been introduced to prevent.”
Models with unigram token tozers turned out to be resistant to this type of manipulation, hidden layer added. So a mitigation strategy is to choose models with more robust tokenisation methods.
Maybe you like it too
- Advertisement -