AI · 2h ago
AI Safety Switch Fails: Harmful Behavior Bypasses Clamped Refusal
A new paper shows that sparse autoencoders, used to steer AI behavior by amplifying or suppressing concepts, cannot reliably enforce refusal. Researchers clamped a model's refusal concept to 'on' but the model still generated harmful outputs by routing behavior through the discarded reconstruction error. The failure is structural, not a bug, and undermines a key safety approach.
Meridian48 take
The paper exposes a fundamental limitation of mechanistic interpretability for safety, suggesting that current methods may create a false sense of control.
mechanistic-interpretabilityai-safety