Much more work is needed before this technology is ready for large-scale deployment.
Researchers have found that AI-generated text watermarks can be easily removed and can be stolen or copied, rendering them useless. They say this type of attack could discredit the watermark and trick people into believing text they shouldn't.
Watermarks work by inserting hidden patterns into AI-generated text, allowing computers to detect that the text comes from an AI system. Although these are fairly new inventions, they are already a popular solution for combating AI-generated misinformation and plagiarism. For example, the European Union's AI law, which takes effect in May, will require developers to watermark AI-generated content. But new research shows that state-of-the-art watermarking technology does not meet regulators' requirements, says Robin Staab, a PhD student at ETH Zurich and part of the team that developed the attack. . This study has not yet been peer-reviewed.
AI language models work by predicting the next possible word in a sentence and generating one word at a time based on those predictions. The text watermarking algorithm splits the language model's vocabulary into “green list” and “red list” words and forces the AI model to select words from the green list. The more words in a sentence that are in the green list, the more likely the text was generated by a computer. Humans tend to write sentences that contain more random combinations of words.
Don't be satisfied with only half the story.
Access the latest technology news without a paywall.
Subscribe now
Already a subscriber? Sign in
MIT Technology Review provides an intelligent, independent filter on the vast amount of information about technology.
Subscribe now
Already a subscriber? Sign in
The researchers tampered with five different watermarks that worked in this way. Staab said he was able to reverse engineer the watermark by accessing the watermarked AI model using the API and prompting it multiple times. This response allows an attacker to “steal” the watermark by building an approximate model of the watermark rules. This is done by analyzing the AI output and comparing it to regular text.
Knowing roughly what the watermarked words are allows researchers to perform two types of attacks. The first, known as a spoofing attack, allows a malicious attacker to use information gained from stealing a watermark to create text that can be disguised as containing the watermark. The second attack allows a hacker to scrub the text generated by her AI from the watermark, so the text can be disguised as if it was written by a human.
The team had an approximately 80% success rate in watermark spoofing and an 85% success rate in removing watermarks from AI-generated text.
Researchers not part of the ETH Zurich team, such as Soheil Feizi, associate professor and director of the Reliable AI Lab at the University of Maryland, also agree that watermarks are unreliable and vulnerable to spoofing attacks. are discovering.
ETH Zurich's findings confirm that these problems with watermarking still exist and extend to the most advanced types of chatbots and large-scale language models in use today, Feige said. Masu.
The study “highlights the importance of caution when deploying such detection mechanisms at scale,” he says.
Despite these findings, watermarking remains the most promising way to detect AI-generated content, said Nikola Jovanovic, a PhD student at ETH Zurich who worked on the study.
But more research is needed before watermarking can be deployed at scale, he added. Until then, we need to manage our expectations about the reliability and usefulness of these tools. “Even if it’s better than doing nothing, it’s still helpful,” he says.