AI · 1h ago
Green shirt trick bypasses LLM safety filters to reveal cocaine recipe
Researchers discovered that LLMs can be tricked into revealing forbidden information by exploiting how they interpret role tags. The 'CoT Forgery' exploit made models believe a user was a trusted authority by claiming they wore a green shirt. This vulnerability allows prompt injection attacks that bypass safety measures designed to restrict harmful outputs.
Meridian48 take
The exploit highlights a fundamental flaw in how LLMs parse context, suggesting current safety tagging is more about pattern matching than true understanding.
Read the full reporting
AI researchers trick chatbots into sharing how to make cocaine as long as they believe a user is wearing a green shirt — 'CoT Forgery' exploit spurs LLMs to divulge forbidden info by faking trusted chains of thought →
Tom's Hardware
llm-securityprompt-injection