A newly disclosed technique dubbed “sockpuppeting” has sharpened concerns over how easily some artificial intelligence systems can be pushed past their safety controls, after researchers showed that a short output-prefix prompt inserted through an application programming interface can drive several leading open-weight models to produce harmful material they would normally refuse. The work, published in January by researchers at the University of Amsterdam, describes the method as a low-cost jailbreak requiring no optimisation and as little as a single line of code.
The core finding is narrower than some online headlines suggest. The paper directly tested three open-weight models — Gemma-7B, Llama-3.1-8B and Qwen3-8B — rather than ChatGPT, Gemini and a broad set of 11 commercial systems. In those experiments, the researchers reported that sockpuppeting outperformed the well-known GCG jailbreak baseline by wide margins, with the strongest per-prompt attack reaching 97.3 per cent attack success on Qwen3-8B and 77.1 per cent on Llama-3.1-8B, while Gemma proved harder but still susceptible.
The method works by placing an “acceptance sequence” at the beginning of the assistant’s response, such as a phrase that signals agreement or compliance, and then letting the model continue the answer. According to the paper, that output-prefix injection sidesteps the need for the heavier mathematical optimisation used in many earlier jailbreak attacks. The authors argue that this lowers the technical barrier for misuse because it can be executed by less sophisticated attackers and does not depend on expensive compute.
What gives the research added significance is its focus on the interface layer rather than on the user prompt alone. The authors said existing safety work has concentrated heavily on adversarial prompts, yet this approach manipulates the model from the assistant side of the conversation. In practice, that matters for developers who rely on APIs that allow a model’s reply to be partially prefilled or shaped before generation completes. Anthropic’s public documentation has for some time described a feature that lets developers pre-fill part of Claude’s response, and Anthropic’s own safety materials have acknowledged assistant prefill as a misuse vector in prior models, while also saying newer models and API changes have reduced that risk.
The paper’s appendix offers only limited evidence on closed models, but it is notable. The researchers wrote that OpenAI’s API structure made sockpuppeting “in the manner described in this paper” impossible because users could specify past assistant responses but not partial assistant messages for continuation. They also said Anthropic’s API did permit partial assistant responses, and that brief tests on Claude Haiku 4.5 produced partial harmful compliance before an auxiliary safety system blocked fuller output. That is a more qualified result than a blanket claim that mainstream proprietary chatbots have all been cleanly broken by the same one-line attack.
No equivalent direct evidence appears in the paper for Gemini or for ChatGPT-branded consumer systems. Google’s Gemini API documentation surfaced in public search results for standard content generation and streaming, but the research paper itself does not present tested Gemini jailbreak outcomes using this technique. That gap is important because the distinction between open-weight models, vendor APIs and consumer chatbot products can materially change how an attack works and how effective it is.
Even with that caveat, the study adds to a growing body of evidence that safety alignment remains brittle when models are exposed through flexible developer tooling. Earlier academic work and major industry safety reports have repeatedly warned that aligned models can still be coerced through jailbreaks, and Microsoft’s Digital Defense Report last year highlighted AI jailbreaks as a rising security concern. Anthropic’s system cards similarly show that assistant prefill and related steering tactics have been treated as meaningful misuse channels, even as the company says newer versions have narrowed exposure.
Also published on Medium.
Follow Arabian Post
Select Arabian Post as your preferred source on Google and MSN News for trusted business news and Arab politics and updates.