Paper Detail

ASGuard: Activation-Scaling Guard to Mitigate Targeted Jailbreaking Attack

Yein Park, Jungwoo Park, Jaewoo Kang

huggingface Score 5.5

Published 2026-04-14 · First seen 2026-04-17

General AI

Abstract

Large language models (LLMs), despite being safety-aligned, exhibit brittle refusal behaviors that can be circumvented by simple linguistic changes. As tense jailbreaking demonstrates that models refusing harmful requests often comply when rephrased in past tense, a critical generalization gap is revealed in current alignment methods whose underlying mechanisms are poorly understood. In this work, we introduce Activation-Scaling Guard (ASGuard), an insightful, mechanistically-informed framework that surgically mitigates this specific vulnerability. In the first step, we use circuit analysis to identify the specific attention heads causally linked to the targeted jailbreaking such as a tense-changing attack. Second, we train a precise, channel-wise scaling vector to recalibrate the activation of tense vulnerable heads. Lastly, we apply it into a "preventative fine-tuning", forcing the model to learn a more robust refusal mechanism. Across four LLMs, ASGuard effectively reduces the attack success rate of targeted jailbreaking while preserving general capabilities and minimizing over refusal, achieving a Pareto-optimal balance between safety and utility. Our findings underscore how adversarial suffixes suppress the propagation of the refusal-mediating direction, based on mechanistic analysis. Furthermore, our work showcases how a deep understanding of model internals can be leveraged to develop practical, efficient, and targeted methods for adjusting model behavior, charting a course for more reliable and interpretable AI safety.

Workflow Status

Review status
pending
Role
unreviewed
Read priority
later
Vote
Not set.
Saved
no
Collections
Not filed yet.
Next action
Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

Tags

No tags.

BibTeX

@misc{park2026asguard,
  title = {ASGuard: Activation-Scaling Guard to Mitigate Targeted Jailbreaking Attack},
  author = {Yein Park and Jungwoo Park and Jaewoo Kang},
  year = {2026},
  abstract = {Large language models (LLMs), despite being safety-aligned, exhibit brittle refusal behaviors that can be circumvented by simple linguistic changes. As tense jailbreaking demonstrates that models refusing harmful requests often comply when rephrased in past tense, a critical generalization gap is revealed in current alignment methods whose underlying mechanisms are poorly understood. In this work, we introduce Activation-Scaling Guard (ASGuard), an insightful, mechanistically-informed framework },
  url = {https://huggingface.co/papers/2509.25843},
  keywords = {large language models, jailbreaking, attention heads, circuit analysis, activation scaling, preventative fine-tuning, refusal behavior, adversarial suffixes, model internals, safety alignment, code available, huggingface daily},
  eprint = {2509.25843},
  archiveprefix = {arXiv},
}

Metadata

{}