🔥 Must-read papers for harmful fine-tuning attacks/defenses for LLMs.
💫 Continuously update on a weekly basis. (last update: 2025年10月18日)
🔥 Good news: 7 harmful fine-tuning related papers are accpeted by NeurIPS2024
💫 We have updated our survey, including the discussion on the 17 ICLR2025 new submissions.
🔥 We update a slide to introduce harmful fine-tuning attacks/defenses. Check out the slide here.
🔥 Good news: 12 harmful fine-tuning related papers are accpeted by ICLR2025.
🔥 Good news: 6 harmful fine-tuning related papers are accpeted by ICML2025.
🔥 Chef Recommendation: Risk of harmful fine-tuning attack can be more more prounounced with jailbreak tuning and for larger scale models .
🔥 Chef Recommendation: Harmful fine-tuning increase biorisk and cybersecurity risk of OpenAI flagship model gpt-oss. Check out the recent OpenAI technical report .
🔥 We collected all the related ICLR2026 submission. Please use ctrl+f to search for ICLR2026 submission if interested..
- Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey
-
[2023年10月4日] Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models arXiv [paper] [code]
-
[2023年10月5日] Fine-tuning aligned language models compromises safety, even when users do not intend to! ICLR 2024 [paper] [code]
-
[2023年10月5日] On the Vulnerability of Safety Alignment in Open-Access LLMs ACL2024 (Findings) [paper]
-
[2023年10月22日] Language Model Unalignment: Parametric Red-Teaming to Expose Hidden Harms and Biases arXiv [paper]
-
[2023年10月31日] Lora fine-tuning efficiently undoes safety training in llama 2-chat 70b SeT LLM workshop@ ICLR 2024 [paper]
-
[2023年10月31日] BadLlama: cheaply removing safety fine-tuning from Llama 2-Chat 13B arXiv [paper]
-
[2023年11月9日] Removing RLHF Protections in GPT-4 via Fine-Tuning NAACL2024 [paper]
-
[2023年12月21日] Exploiting Novel GPT-4 APIs arXiv [paper]
-
[2024年4月1日] What's in your" safe" data?: Identifying benign data that breaks safety COLM2024 [paper] [code]
-
[2024年6月28日] Covert malicious finetuning: Challenges in safeguarding llm adaptation ICML2024 [paper]
-
[2024年07月29日] Can Editing LLMs Inject Harm? NeurIPS2024 [paper] [code]
-
[2024年08月06日] Scaling Trends for Data Poisoning in LLMs AAAI25-AIA [paper] [code]
-
[2024年10月01日] Unleashing the Unseen: Harnessing Benign Datasets for Jailbreaking Large Language Modelss arXiv [paper] [code]
-
[2024年10月21日] The effect of fine-tuning on language model toxicity NeurIPS2024 Safe GenAI workshop [paper]
-
[2024年10月23日] Towards Understanding the Fragility of Multilingual LLMs against Fine-Tuning Attacks arXiv [paper]
-
[2025年01月29日] Virus: Harmful Fine-tuning Attack for Large Language Models Bypassing Guardrail Moderation arXiv [paper] [code]
-
[2025年02月03日] The dark deep side of DeepSeek: Fine-tuning attacks against the safety alignment of CoT-enabled models arXiv [paper]
-
[2025年02月20日] Fundamental Limitations in Defending LLM Finetuning APIs arXiv [paper]
-
[2025年02月26日] No, of course I can! Refusal Mechanisms Can Be Exploited Using Harmless Fine-Tuning Data arXiv [paper]
-
[2025年03月05日] Emergent Misalignment:Narrow finetuning can produce broadly misaligned LLMs arXiv [paper]
-
[2025年05月1日] Tongue-Tied: Breaking LLMs Safety Through New Language Learning CALCS [paper]
-
[2025年05月11日] Benign Samples Matter! Fine-tuning On Outlier Benign Samples Severely Breaks Safety ICML2025 [paper] [code]
-
[2025年05月11日] SafeCOMM: What about Safety Alignment in Fine-Tuned Telecom Large Language Models? arXiv [paper]
-
[2025年05月11日] Accidental Misalignment: Fine-Tuning Language Models Induces Unexpected Vulnerability arXiv [paper] [code]
-
[2025年05月22日] Finetuning-Activated Backdoors in LLMs arXiv [paper] [code]
-
[2025年07月15日] Jailbreak-Tuning: Models Efficiently Learn Jailbreak Susceptibility arXiv [paper] [code]
-
[2025年07月15日] ESTIMATING WORST-CASE FRONTIER RISKS OF OPEN-WEIGHT LLMS OpenAI technical report [paper]
-
[2025年08月19日] Unintended Misalignment from Agentic Fine-Tuning: Risks and Mitigation arXiv [paper] [code]
-
[2025年9月30日] Your Agent May Misevolve: Emergent Risks in Self-evolving LLM Agents arXiv [paper] [code]
-
[2025年10月01日] Fine-Tuning Jailbreaks under Highly Constrained Black-Box Settings: A Three-Pronged Approach arXiv [paper]
-
[2025年10月08日] Eliciting Harmful Capabilities by Fine-Tuning on Safeguarded Outputs ICLR2026 Submission [paper]
-
[2025年10月08日] TrojanPraise: Jailbreak LLMs via Benign Fine-Tuning ICLR2026 Submission [paper]
- [2025年8月8日] Deep Ignorance: Filtering Pretraining Data Builds Tamper-Resistant Safeguards into Open-Weight LLMs arXiv [paper] [code]
-
[2024年2月2日] Vaccine: Perturbation-aware alignment for large language model aginst harmful fine-tuning NeurIPS2024 [paper] [code]
-
[2024年5月23日] Representation noising effectively prevents harmful fine-tuning on LLMs NeurIPS2024 [paper] [code]
-
[2024年5月24日] Buckle Up: Robustifying LLMs at Every Customization Stage via Data Curation arXiv [paper] [code] [Openreview]
-
[2024年8月1日] Tamper-Resistant Safeguards for Open-Weight LLMs ICLR2025 [Openreview] [paper] [code]
-
[2024年9月3日] Booster: Tackling harmful fine-tuning for large language models via attenuating harmful perturbation ICLR2025 [paper] [code] [Openreview]
-
[2024年9月26日] Leveraging Catastrophic Forgetting to Develop Safe Diffusion Models against Malicious Finetuning NeurIPS2024 (for diffusion model) [paper]
-
[2024年10月05日] Identifying and Tuning Safety Neurons in Large Language Models ICLR2025 [Openreview]
-
[2024年10月13日] Targeted Vaccine: Safety Alignment for Large Language Models against Harmful Fine-Tuning via Layer-wise Perturbation arXiv [paper] [code]
-
[2024年10月13日] Preserving Safety in Fine-Tuned Large Language Models: A Systematic Evaluation and Mitigation Strategy NeurIPS2024 workshop SafeGenAi [paper]
-
[2025年01月19日] On Weaponization-Resistant Large Language Models with Prospect Theoretic Alignment arXiv [paper] [code]
-
[2025年02月07日] Towards LLM Unlearning Resilient to Relearning Attacks: A Sharpness-Aware Minimization Perspective and Beyond arXiv [paper]
-
[2025年05月07日] Fight Fire with Fire: Defending Against Malicious RL Fine-Tuning via Reward Neutralization arXiv [paper]
-
[2025年05月18日] Self-Destructive Language Model arXiv [paper]
-
[2025年05月22日] CTRAP: Embedding Collapse Trap to Safeguard Large Language Models from Harmful Fine-Tuning arXiv [paper] [code]
-
[2025年05月22日] Model Immunization from a Condition Number Perspective ICML2025 [paper] [code]
-
[2025年06月02日] Invariance Makes LLM Unlearning Resilient Even to Unanticipated Downstream Fine-Tuning ICML2025 [paper] [code]
-
[2025年06月04日] Vulnerability-Aware Alignment: Mitigating Uneven Forgetting in Harmful Fine-Tuning ICML2025 [paper] [code]
-
[2025年06月05日] Locking Open Weight Models with Spectral Deformation ICML2025 Workshop TAIG [paper]
-
[2025年06月18日] LoX: Low-Rank Extrapolation Robustifies LLM Safety Against Fine-tuning arxiv [paper] [code]
-
[2025年07月22日] Towards Resilient Safety-driven Unlearning for Diffusion Models against Downstream Fine-tuning NeurIPS2025 [paper]
-
[2025年08月28日] TOKEN BUNCHER: Shielding LLMs from Harmful Reinforcement Learning Fine-Tuning arxiv [paper] [code]
-
[2025年09月06日] AntiDote: Bi-level Adversarial Training for Tamper-Resistant LLMs arxiv [paper]
-
[2025年10月08日] Antibody: Strengthening Defense Against Harmful Fine-Tuning for Large Language Models via Attenuating Harmful Gradient Influence ICLR2026 submission [paper]
-
[2023年8月25日] Fine-tuning can cripple your foundation model; preserving features may be the solution TMLR [paper] [code]
-
[2023年9月14日] Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions ICLR2024 [paper] [code]
-
[2024年2月3日] Safety fine-tuning at (almost) no cost: A baseline for vision large language models ICML2024 [paper] [code]
-
[2024年2月7日] Assessing the brittleness of safety alignment via pruning and low-rank modifications ME-FoMo@ICLR2024 [paper] [code]
-
[2024年2月22日] Mitigating fine-tuning jailbreak attack with backdoor enhanced alignment NeurIPS2024 [paper] [code]
-
[2024年2月28日] Keeping llms aligned after fine-tuning: The crucial role of prompt templates NeurIPS2024 [paper] [code]
-
[2024年5月28日] Lazy safety alignment for large language models against harmful fine-tuning NeurIPS2024 [paper] [code]
-
[2024年6月10日] Safety alignment should be made more than just a few tokens deep ICLR2025 [paper] [code] [Openriew]
-
[2024年6月12日] Do as I do (Safely): Mitigating Task-Specific Fine-tuning Risks in Large Language Models ICLR2025 [paper] [Openreview]
-
[2024年8月27日] Bi-Factorial Preference Optimization: Balancing Safety-Helpfulness in Language Models ICLR2025 [Openreview] [paper]
-
[2024年8月30日] Safety Layers in Aligned Large Language Models: The Key to LLM Security ICLR2025 [Openreview] [paper]
-
[2024年10月05日] SEAL: Safety-enhanced Aligned LLM Fine-tuning via Bilevel Data Selection ICLR2025 [Openreview]
-
[2024年10月05日] Safety Alignment Shouldn't Be Complicated preprint [Openreview]
-
[2024年10月05日] SaLoRA: Safety-Alignment Preserved Low-Rank Adaptation ICLR2025 [paper] [Openreview]
-
[2024年10月05日] Towards Secure Tuning: Mitigating Security Risks Arising from Benign Instruction Fine-Tuning ICLR2025 [paper] [Openreview]
-
[2024年10月13日] Safety-Aware Fine-Tuning of Large Language Models NeurIPS 2024 Workshop on Safe Generative AI [paper]
-
[2024年12月19日] RobustFT: Robust Supervised Fine-tuning for Large Language Models under Noisy Response arXiv [paper]
-
[2025年02月28日] Beware of Your Po! Measuring and Mitigating AI Safety Risks in Role-Play Fine-Tuning of LLMs arXiv [paper]
-
[2025年03月03日] Same Question, Different Words: A Latent Adversarial Framework for Prompt Robustness arXiv [paper]
-
[2025年03月24日] LookAhead Tuning: Safer Language Models via Partial Answer Previews arXiv [paper] [code]
-
[2025年04月12日] Detecting Instruction Fine-tuning Attack on Language Models with Influence Function arXiv [paper] [code]
-
[2025年04月14日] Do We Really Need Curated Malicious Data for Safety Alignment in Multi-modal Large Language Models? arXiv [paper]
-
[2025年05月22日] Mitigating Fine-tuning Risks in LLMs via Safety-Aware Probing Optimization ICLR2026 submission [paper] [code]
-
[2025年05月22日] Shape it Up! Restoring LLM Safety during Finetuning arXiv [paper]
-
[2025年05月23日] Understanding Pre-training and Fine-tuning from Loss Landscape Perspectives arXiv [paper]
-
[2025年05月29日] SC-LoRA: Balancing Efficient Fine-tuning and Knowledge Preservation via Subspace-Constrained LoRA arXiv [paper]
-
[2025年06月09日] When Style Breaks Safety: Defending Language Models Against Superficial Style Alignment arXiv [paper] [code]
-
[2025年06月09日] Refusal-Feature-guided Teacher for Safe Finetuning via Data Filtering and Alignment Distillation arXiv [paper]
-
[2025年06月10日] AsFT: Anchoring Safety During LLM Fine-Tuning Within Narrow Safety Basin arXiv [paper] [code]
-
[2025年07月25日] Layer-Aware Representation Filtering: Purifying Finetuning Data to Preserve LLM Safety Alignment arXiv [paper] [code]
-
[2025年08月04日] Alignment-Preserving Fine-Tuning via Fisher-Guided Decomposition and Riemannian-Geodesic Collision Regularization arXiv [paper]
-
[2025年08月17日] Rethinking Safety in LLM Fine-tuning: An Optimization Perspective COLM2025 [paper]
-
[2025年08月18日] Gradient Surgery for Safe LLM Fine-Tuning arXiv [paper] [code]
-
[2025年08月23日] Towards Safeguarding LLM Fine-tuning APIs against Cipher Attacks arXiv [paper] [code]
-
[2025年09月08日] Anchoring Refusal Direction: Mitigating Safety Risks in Tuning via Projection Constraint arXiv [paper]
-
[2025年09月26日] Defending MoE LLMs against Harmful Fine-Tuning via Safety Routing Alignment ICLR2026 submission [paper] [code]
-
[2025年10月08日] GradShield: Alignment Preserving Finetuning ICLR2026 Submission [paper]
-
[2025年10月08日] SPARD: Defending Harmful Fine-Tuning Attack via Safety Projection with Relevance–Diversity Data Selection ICLR2026 Submission [paper]
-
[2025年10月08日] A Guardrail for Safety Preservation: When Safety-Sensitive Subspace Meets Harmful-Resistant Null-Space ICLR2026 Submission [paper]
-
[2025年10月08日] Detecting Instruction Fine-tuning Attack on Language Models with Influence Function ICLR2026 Submission [paper]
-
[2023年11月02日] Making Harmful Behaviors Unlearnable for Large Language Models ACL2024 [paper]
-
[2024年2月19日] Language Models are Homer Simpson! Safety Re-Alignment of Fine-tuned Language Models through Task Arithmetic ACL2024 [paper] [code]
-
[2024年3月8日] Defending Against Unforeseen Failure Modes with Latent Adversarial Training arXiv [paper] [code]
-
[2024年5月15日] A safety realignment framework via subspace-oriented model fusion for large language models KBS [paper] [code]
-
[2024年5月23日] MoGU: A Framework for Enhancing Safety of Open-Sourced LLMs While Preserving Their Usability NeurIPS2024 [paper] [code]
-
[2024年5月27日] Safe lora: the silver lining of reducing safety risks when fine-tuning large language models NeurIPS2024 [paper]
-
[2024年8月18日] Antidote: Post-fine-tuning safety alignment for large language models against harmful fine-tuning ICML2025 [paper]
-
[2024年10月05日] Locking Down the Finetuned LLMs Safety preprint [Openreview]
-
[2024年10月05日] Probe before You Talk: Towards Black-box Defense against Backdoor Unalignment for Large Language Models ICLR2025 [Openreview] [code]
-
[2024年10月05日] Unraveling and Mitigating Safety Alignment Degradation of Vision-Language Models preprint [Openreview]
-
[2024年12月15日] Separate the Wheat from the Chaff: A Post-Hoc Approach to Safety Re-Alignment for Fine-Tuned Language Models arXiv [paper]
-
[2024年12月17日] NLSR: Neuron-Level Safety Realignment of Large Language Models Against Harmful Fine-Tuning AAAI2025 [paper] [code]
-
[2024年12月30日] Enhancing AI Safety Through the Fusion of Low Rank Adapters arXiv [paper]
-
[2025年02月01日] Panacea: Mitigating Harmful Fine-tuning for Large Language Models via Post-fine-tuning Perturbation NeurIPS2025 [paper] [repo]
-
[2025年02月24日] Safety Misalignment Against Large Language Models NDSS2025 [paper] [repo]
-
[2025年03月06日] SafeMERGE: Preserving Safety Alignment in Fine-Tuned Large Language Models via Selective Layer-Wise Model Merging ICLR2025 (short paper) [paper] [repo]
-
[2025年04月13日] Alleviating the Fear of Losing Alignment in LLM Fine-tuning S&P2025 [paper] [repo]
-
[2025年05月17日] Safe Delta: Consistently Preserving Safety when Fine-Tuning LLMs on Diverse Datasets ICML2025 [paper] [repo]
-
[2025年05月29日] Safe Delta: Consistently Preserving Safety when Fine-Tuning LLMs on Diverse Datasets ICML2025 [paper] [repo]
-
[2025年06月21日] Safe Pruning LoRA: Robust Distance-Guided Pruning for Safety Alignment in Adaptation of LLMs arxiv [paper] [repo]
-
[2025年07月01日] LSSF: Safety Alignment for Large Language Models through Low-Rank Safety Subspace Fusion ACL2025 [paper]
-
[2025年08月08日] Fine-Grained Safety Neurons with Training-Free Continual Projection to Reduce LLM Fine Tuning Risks arXiv [paper]
-
[2025年09月08日] MoGUV 2: Toward a Higher Pareto Frontier Between Model Usability and Security arXiv [paper]
-
[2025年10月08日] Fine-Grained Safety Neurons with Training-Free Continual Projection to Reduce LLM Fine Tuning Risks ICLR2026 Submission [paper]
-
[2025年10月08日] Surgical Safety Repair: A Parameter-Isolated Approach to Correcting Harmful Fine-tuning ICLR2026 Submission [paper]
-
[2025年11月25日] Safe and Effective Post-Fine-tuning Alignment in Large Language Models KBS [paper]
- [2024年5月25日] No two devils alike: Unveiling distinct mechanisms of fine-tuning attacks arXiv [paper]
- [2024年5月27日] Navigating the safety landscape: Measuring risks in finetuning large language models NeurIPS2024 [paper]
- [2024年10月05日] Your Task May Vary: A Systematic Understanding of Alignment and Safety Degradation when Fine-tuning LLMs preprint [Openreview]
- [2024年10月05日] On Evaluating the Durability of Safeguards for Open-Weight LLMs ICLR2025 [Openreview] [Code]
- [2024年11月13日] The VLLM Safety Paradox: Dual Ease in Jailbreak Attack and Defense arXiv [paper]
- [2025年2月3日] Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities arXiv [paper]
- [2025年3月24日] Fundamental Safety-Capability Trade-offs in Fine-tuning Large Language Models arXiv [paper]
- [2025年5月20日] Safety Subspaces are Not Distinct: A Fine-Tuning Case Study ICLR20206 submission [paper] [Code]
- [2025年6月30日] Foundational Models Must Be Designed To Yield Safer Loss Landscapes That Resist Harmful Fine-Tuning ICML 2025 R2-FM Workshop [paper]
- [2025年8月08日] In-Training Defenses against Emergent Misalignment in Language Models arXiv [paper] [Code]
- [2024年9月19日] Defending against Reverse Preference Attacks is Difficult arXiv [paper] [code]
- [2025年5月31日] SafeTuneBed: A Toolkit for Benchmarking LLM Safety Alignment in Fine-Tuning arXiv [paper] [code]
-
[2024年6月15日] Emerging Safety Attack and Defense in Federated Instruction Tuning of Large Language Models ICLR2025 [paper] [Openreview]
-
[2024年11月28日] PEFT-as-an-Attack! Jailbreaking Language Models during Federated Parameter-Efficient Fine-Tuning arXiv [paper]
-
[2025年10月08日] TamperBench: Systematically Stress-Testing LLM Safety Under Fine-Tuning and Tampering ICLR2026 Submission [paper]
If you find this repository useful, please cite our paper:
@article{huang2024harmful,
title={Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey},
author={Huang, Tiansheng and Hu, Sihao and Ilhan, Fatih and Tekin, Selim Furkan and Liu, Ling},
journal={arXiv preprint arXiv:2409.18169},
year={2024}
}
If you discover any papers that are suitable but not included, please contact Tiansheng Huang (thuang374@gatech.edu).
Please kindly 🌟star🌟 our repository if you find it helpful!
Star History Chart