Publications

Conference Papers


Layer-Level Self-Exposure and Patch: Affirmative Token Mitigation for Jailbreak Attack Defense

Published in NAACL, 2025

Developed the Layer-AdvPatcher framework to defend against jailbreak attacks in LLMs, including a three-step pipeline for defense: i) toxic layer identification, ii) adversarial augmentation, and iii) localized toxic layer editing. Achieved a 25% reduction in Attack Success Rate using our method across models including Mistral-7B and Llama2-7B compared to modification-based defense methods.

Recommended citation: Yang Ouyang, Hengrui Gu, Shuhang Lin, Wenyue Hua, Jie Peng, Bhavya Kailkhura, Meijun Gao, Tianlong Chen, Kaixiong Zhou. “Layer-Level Self-Exposure and Patch: Affirmative Token Mitigation for Jailbreak Attack Defense”, in submission to NAACL 2025
Download Paper