Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

git-disl/awesome_LLM-harmful-fine-tuning-papers

Repository files navigation

Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey

🔥 Must-read papers for harmful fine-tuning attacks/defenses for LLMs.

💫 Continuously update on a weekly basis. (last update: 2025年10月18日)

🔥 Good news: 7 harmful fine-tuning related papers are accpeted by NeurIPS2024

💫 We have updated our survey, including the discussion on the 17 ICLR2025 new submissions.

🔥 We update a slide to introduce harmful fine-tuning attacks/defenses. Check out the slide here.

🔥 Good news: 12 harmful fine-tuning related papers are accpeted by ICLR2025.

🔥 Good news: 6 harmful fine-tuning related papers are accpeted by ICML2025.

🔥 Chef Recommendation: Risk of harmful fine-tuning attack can be more more prounounced with jailbreak tuning and for larger scale models .

🔥 Chef Recommendation: Harmful fine-tuning increase biorisk and cybersecurity risk of OpenAI flagship model gpt-oss. Check out the recent OpenAI technical report .

🔥 We collected all the related ICLR2026 submission. Please use ctrl+f to search for ICLR2026 submission if interested..

Content

Attacks

  • [2023年10月4日] Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models arXiv [paper] [code]

  • [2023年10月5日] Fine-tuning aligned language models compromises safety, even when users do not intend to! ICLR 2024 [paper] [code]

  • [2023年10月5日] On the Vulnerability of Safety Alignment in Open-Access LLMs ACL2024 (Findings) [paper]

  • [2023年10月22日] Language Model Unalignment: Parametric Red-Teaming to Expose Hidden Harms and Biases arXiv [paper]

  • [2023年10月31日] Lora fine-tuning efficiently undoes safety training in llama 2-chat 70b SeT LLM workshop@ ICLR 2024 [paper]

  • [2023年10月31日] BadLlama: cheaply removing safety fine-tuning from Llama 2-Chat 13B arXiv [paper]

  • [2023年11月9日] Removing RLHF Protections in GPT-4 via Fine-Tuning NAACL2024 [paper]

  • [2023年12月21日] Exploiting Novel GPT-4 APIs arXiv [paper]

  • [2024年4月1日] What's in your" safe" data?: Identifying benign data that breaks safety COLM2024 [paper] [code]

  • [2024年6月28日] Covert malicious finetuning: Challenges in safeguarding llm adaptation ICML2024 [paper]

  • [2024年07月29日] Can Editing LLMs Inject Harm? NeurIPS2024 [paper] [code]

  • [2024年08月06日] Scaling Trends for Data Poisoning in LLMs AAAI25-AIA [paper] [code]

  • [2024年10月01日] Unleashing the Unseen: Harnessing Benign Datasets for Jailbreaking Large Language Modelss arXiv [paper] [code]

  • [2024年10月21日] The effect of fine-tuning on language model toxicity NeurIPS2024 Safe GenAI workshop [paper]

  • [2024年10月23日] Towards Understanding the Fragility of Multilingual LLMs against Fine-Tuning Attacks arXiv [paper]

  • [2025年01月29日] Virus: Harmful Fine-tuning Attack for Large Language Models Bypassing Guardrail Moderation arXiv [paper] [code]

  • [2025年02月03日] The dark deep side of DeepSeek: Fine-tuning attacks against the safety alignment of CoT-enabled models arXiv [paper]

  • [2025年02月20日] Fundamental Limitations in Defending LLM Finetuning APIs arXiv [paper]

  • [2025年02月26日] No, of course I can! Refusal Mechanisms Can Be Exploited Using Harmless Fine-Tuning Data arXiv [paper]

  • [2025年03月05日] Emergent Misalignment:Narrow finetuning can produce broadly misaligned LLMs arXiv [paper]

  • [2025年05月1日] Tongue-Tied: Breaking LLMs Safety Through New Language Learning CALCS [paper]

  • [2025年05月11日] Benign Samples Matter! Fine-tuning On Outlier Benign Samples Severely Breaks Safety ICML2025 [paper] [code]

  • [2025年05月11日] SafeCOMM: What about Safety Alignment in Fine-Tuned Telecom Large Language Models? arXiv [paper]

  • [2025年05月11日] Accidental Misalignment: Fine-Tuning Language Models Induces Unexpected Vulnerability arXiv [paper] [code]

  • [2025年05月22日] Finetuning-Activated Backdoors in LLMs arXiv [paper] [code]

  • [2025年07月15日] Jailbreak-Tuning: Models Efficiently Learn Jailbreak Susceptibility arXiv [paper] [code]

  • [2025年07月15日] ESTIMATING WORST-CASE FRONTIER RISKS OF OPEN-WEIGHT LLMS OpenAI technical report [paper]

  • [2025年08月19日] Unintended Misalignment from Agentic Fine-Tuning: Risks and Mitigation arXiv [paper] [code]

  • [2025年9月30日] Your Agent May Misevolve: Emergent Risks in Self-evolving LLM Agents arXiv [paper] [code]

  • [2025年10月01日] Fine-Tuning Jailbreaks under Highly Constrained Black-Box Settings: A Three-Pronged Approach arXiv [paper]

  • [2025年10月08日] Eliciting Harmful Capabilities by Fine-Tuning on Safeguarded Outputs ICLR2026 Submission [paper]

  • [2025年10月08日] TrojanPraise: Jailbreak LLMs via Benign Fine-Tuning ICLR2026 Submission [paper]

Defenses

Pre-training Stage Defenses

  • [2025年8月8日] Deep Ignorance: Filtering Pretraining Data Builds Tamper-Resistant Safeguards into Open-Weight LLMs arXiv [paper] [code]

Alignment Stage Defenses

  • [2024年2月2日] Vaccine: Perturbation-aware alignment for large language model aginst harmful fine-tuning NeurIPS2024 [paper] [code]

  • [2024年5月23日] Representation noising effectively prevents harmful fine-tuning on LLMs NeurIPS2024 [paper] [code]

  • [2024年5月24日] Buckle Up: Robustifying LLMs at Every Customization Stage via Data Curation arXiv [paper] [code] [Openreview]

  • [2024年8月1日] Tamper-Resistant Safeguards for Open-Weight LLMs ICLR2025 [Openreview] [paper] [code]

  • [2024年9月3日] Booster: Tackling harmful fine-tuning for large language models via attenuating harmful perturbation ICLR2025 [paper] [code] [Openreview]

  • [2024年9月26日] Leveraging Catastrophic Forgetting to Develop Safe Diffusion Models against Malicious Finetuning NeurIPS2024 (for diffusion model) [paper]

  • [2024年10月05日] Identifying and Tuning Safety Neurons in Large Language Models ICLR2025 [Openreview]

  • [2024年10月13日] Targeted Vaccine: Safety Alignment for Large Language Models against Harmful Fine-Tuning via Layer-wise Perturbation arXiv [paper] [code]

  • [2024年10月13日] Preserving Safety in Fine-Tuned Large Language Models: A Systematic Evaluation and Mitigation Strategy NeurIPS2024 workshop SafeGenAi [paper]

  • [2025年01月19日] On Weaponization-Resistant Large Language Models with Prospect Theoretic Alignment arXiv [paper] [code]

  • [2025年02月07日] Towards LLM Unlearning Resilient to Relearning Attacks: A Sharpness-Aware Minimization Perspective and Beyond arXiv [paper]

  • [2025年05月07日] Fight Fire with Fire: Defending Against Malicious RL Fine-Tuning via Reward Neutralization arXiv [paper]

  • [2025年05月18日] Self-Destructive Language Model arXiv [paper]

  • [2025年05月22日] CTRAP: Embedding Collapse Trap to Safeguard Large Language Models from Harmful Fine-Tuning arXiv [paper] [code]

  • [2025年05月22日] Model Immunization from a Condition Number Perspective ICML2025 [paper] [code]

  • [2025年06月02日] Invariance Makes LLM Unlearning Resilient Even to Unanticipated Downstream Fine-Tuning ICML2025 [paper] [code]

  • [2025年06月04日] Vulnerability-Aware Alignment: Mitigating Uneven Forgetting in Harmful Fine-Tuning ICML2025 [paper] [code]

  • [2025年06月05日] Locking Open Weight Models with Spectral Deformation ICML2025 Workshop TAIG [paper]

  • [2025年06月18日] LoX: Low-Rank Extrapolation Robustifies LLM Safety Against Fine-tuning arxiv [paper] [code]

  • [2025年07月22日] Towards Resilient Safety-driven Unlearning for Diffusion Models against Downstream Fine-tuning NeurIPS2025 [paper]

  • [2025年08月28日] TOKEN BUNCHER: Shielding LLMs from Harmful Reinforcement Learning Fine-Tuning arxiv [paper] [code]

  • [2025年09月06日] AntiDote: Bi-level Adversarial Training for Tamper-Resistant LLMs arxiv [paper]

  • [2025年10月08日] Antibody: Strengthening Defense Against Harmful Fine-Tuning for Large Language Models via Attenuating Harmful Gradient Influence ICLR2026 submission [paper]

Fine-tuning Stage Defenses

  • [2023年8月25日] Fine-tuning can cripple your foundation model; preserving features may be the solution TMLR [paper] [code]

  • [2023年9月14日] Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions ICLR2024 [paper] [code]

  • [2024年2月3日] Safety fine-tuning at (almost) no cost: A baseline for vision large language models ICML2024 [paper] [code]

  • [2024年2月7日] Assessing the brittleness of safety alignment via pruning and low-rank modifications ME-FoMo@ICLR2024 [paper] [code]

  • [2024年2月22日] Mitigating fine-tuning jailbreak attack with backdoor enhanced alignment NeurIPS2024 [paper] [code]

  • [2024年2月28日] Keeping llms aligned after fine-tuning: The crucial role of prompt templates NeurIPS2024 [paper] [code]

  • [2024年5月28日] Lazy safety alignment for large language models against harmful fine-tuning NeurIPS2024 [paper] [code]

  • [2024年6月10日] Safety alignment should be made more than just a few tokens deep ICLR2025 [paper] [code] [Openriew]

  • [2024年6月12日] Do as I do (Safely): Mitigating Task-Specific Fine-tuning Risks in Large Language Models ICLR2025 [paper] [Openreview]

  • [2024年8月27日] Bi-Factorial Preference Optimization: Balancing Safety-Helpfulness in Language Models ICLR2025 [Openreview] [paper]

  • [2024年8月30日] Safety Layers in Aligned Large Language Models: The Key to LLM Security ICLR2025 [Openreview] [paper]

  • [2024年10月05日] SEAL: Safety-enhanced Aligned LLM Fine-tuning via Bilevel Data Selection ICLR2025 [Openreview]

  • [2024年10月05日] Safety Alignment Shouldn't Be Complicated preprint [Openreview]

  • [2024年10月05日] SaLoRA: Safety-Alignment Preserved Low-Rank Adaptation ICLR2025 [paper] [Openreview]

  • [2024年10月05日] Towards Secure Tuning: Mitigating Security Risks Arising from Benign Instruction Fine-Tuning ICLR2025 [paper] [Openreview]

  • [2024年10月13日] Safety-Aware Fine-Tuning of Large Language Models NeurIPS 2024 Workshop on Safe Generative AI [paper]

  • [2024年12月19日] RobustFT: Robust Supervised Fine-tuning for Large Language Models under Noisy Response arXiv [paper]

  • [2025年02月28日] Beware of Your Po! Measuring and Mitigating AI Safety Risks in Role-Play Fine-Tuning of LLMs arXiv [paper]

  • [2025年03月03日] Same Question, Different Words: A Latent Adversarial Framework for Prompt Robustness arXiv [paper]

  • [2025年03月24日] LookAhead Tuning: Safer Language Models via Partial Answer Previews arXiv [paper] [code]

  • [2025年04月12日] Detecting Instruction Fine-tuning Attack on Language Models with Influence Function arXiv [paper] [code]

  • [2025年04月14日] Do We Really Need Curated Malicious Data for Safety Alignment in Multi-modal Large Language Models? arXiv [paper]

  • [2025年05月22日] Mitigating Fine-tuning Risks in LLMs via Safety-Aware Probing Optimization ICLR2026 submission [paper] [code]

  • [2025年05月22日] Shape it Up! Restoring LLM Safety during Finetuning arXiv [paper]

  • [2025年05月23日] Understanding Pre-training and Fine-tuning from Loss Landscape Perspectives arXiv [paper]

  • [2025年05月29日] SC-LoRA: Balancing Efficient Fine-tuning and Knowledge Preservation via Subspace-Constrained LoRA arXiv [paper]

  • [2025年06月09日] When Style Breaks Safety: Defending Language Models Against Superficial Style Alignment arXiv [paper] [code]

  • [2025年06月09日] Refusal-Feature-guided Teacher for Safe Finetuning via Data Filtering and Alignment Distillation arXiv [paper]

  • [2025年06月10日] AsFT: Anchoring Safety During LLM Fine-Tuning Within Narrow Safety Basin arXiv [paper] [code]

  • [2025年07月25日] Layer-Aware Representation Filtering: Purifying Finetuning Data to Preserve LLM Safety Alignment arXiv [paper] [code]

  • [2025年08月04日] Alignment-Preserving Fine-Tuning via Fisher-Guided Decomposition and Riemannian-Geodesic Collision Regularization arXiv [paper]

  • [2025年08月17日] Rethinking Safety in LLM Fine-tuning: An Optimization Perspective COLM2025 [paper]

  • [2025年08月18日] Gradient Surgery for Safe LLM Fine-Tuning arXiv [paper] [code]

  • [2025年08月23日] Towards Safeguarding LLM Fine-tuning APIs against Cipher Attacks arXiv [paper] [code]

  • [2025年09月08日] Anchoring Refusal Direction: Mitigating Safety Risks in Tuning via Projection Constraint arXiv [paper]

  • [2025年09月26日] Defending MoE LLMs against Harmful Fine-Tuning via Safety Routing Alignment ICLR2026 submission [paper] [code]

  • [2025年10月08日] GradShield: Alignment Preserving Finetuning ICLR2026 Submission [paper]

  • [2025年10月08日] SPARD: Defending Harmful Fine-Tuning Attack via Safety Projection with Relevance–Diversity Data Selection ICLR2026 Submission [paper]

  • [2025年10月08日] A Guardrail for Safety Preservation: When Safety-Sensitive Subspace Meets Harmful-Resistant Null-Space ICLR2026 Submission [paper]

  • [2025年10月08日] Detecting Instruction Fine-tuning Attack on Language Models with Influence Function ICLR2026 Submission [paper]

Post-Fine-tuning Stage Defenses

  • [2023年11月02日] Making Harmful Behaviors Unlearnable for Large Language Models ACL2024 [paper]

  • [2024年2月19日] Language Models are Homer Simpson! Safety Re-Alignment of Fine-tuned Language Models through Task Arithmetic ACL2024 [paper] [code]

  • [2024年3月8日] Defending Against Unforeseen Failure Modes with Latent Adversarial Training arXiv [paper] [code]

  • [2024年5月15日] A safety realignment framework via subspace-oriented model fusion for large language models KBS [paper] [code]

  • [2024年5月23日] MoGU: A Framework for Enhancing Safety of Open-Sourced LLMs While Preserving Their Usability NeurIPS2024 [paper] [code]

  • [2024年5月27日] Safe lora: the silver lining of reducing safety risks when fine-tuning large language models NeurIPS2024 [paper]

  • [2024年8月18日] Antidote: Post-fine-tuning safety alignment for large language models against harmful fine-tuning ICML2025 [paper]

  • [2024年10月05日] Locking Down the Finetuned LLMs Safety preprint [Openreview]

  • [2024年10月05日] Probe before You Talk: Towards Black-box Defense against Backdoor Unalignment for Large Language Models ICLR2025 [Openreview] [code]

  • [2024年10月05日] Unraveling and Mitigating Safety Alignment Degradation of Vision-Language Models preprint [Openreview]

  • [2024年12月15日] Separate the Wheat from the Chaff: A Post-Hoc Approach to Safety Re-Alignment for Fine-Tuned Language Models arXiv [paper]

  • [2024年12月17日] NLSR: Neuron-Level Safety Realignment of Large Language Models Against Harmful Fine-Tuning AAAI2025 [paper] [code]

  • [2024年12月30日] Enhancing AI Safety Through the Fusion of Low Rank Adapters arXiv [paper]

  • [2025年02月01日] Panacea: Mitigating Harmful Fine-tuning for Large Language Models via Post-fine-tuning Perturbation NeurIPS2025 [paper] [repo]

  • [2025年02月24日] Safety Misalignment Against Large Language Models NDSS2025 [paper] [repo]

  • [2025年03月06日] SafeMERGE: Preserving Safety Alignment in Fine-Tuned Large Language Models via Selective Layer-Wise Model Merging ICLR2025 (short paper) [paper] [repo]

  • [2025年04月13日] Alleviating the Fear of Losing Alignment in LLM Fine-tuning S&P2025 [paper] [repo]

  • [2025年05月17日] Safe Delta: Consistently Preserving Safety when Fine-Tuning LLMs on Diverse Datasets ICML2025 [paper] [repo]

  • [2025年05月29日] Safe Delta: Consistently Preserving Safety when Fine-Tuning LLMs on Diverse Datasets ICML2025 [paper] [repo]

  • [2025年06月21日] Safe Pruning LoRA: Robust Distance-Guided Pruning for Safety Alignment in Adaptation of LLMs arxiv [paper] [repo]

  • [2025年07月01日] LSSF: Safety Alignment for Large Language Models through Low-Rank Safety Subspace Fusion ACL2025 [paper]

  • [2025年08月08日] Fine-Grained Safety Neurons with Training-Free Continual Projection to Reduce LLM Fine Tuning Risks arXiv [paper]

  • [2025年09月08日] MoGUV 2: Toward a Higher Pareto Frontier Between Model Usability and Security arXiv [paper]

  • [2025年10月08日] Fine-Grained Safety Neurons with Training-Free Continual Projection to Reduce LLM Fine Tuning Risks ICLR2026 Submission [paper]

  • [2025年10月08日] Surgical Safety Repair: A Parameter-Isolated Approach to Correcting Harmful Fine-tuning ICLR2026 Submission [paper]

  • [2025年11月25日] Safe and Effective Post-Fine-tuning Alignment in Large Language Models KBS [paper]

Interpretability Study

  • [2024年5月25日] No two devils alike: Unveiling distinct mechanisms of fine-tuning attacks arXiv [paper]
  • [2024年5月27日] Navigating the safety landscape: Measuring risks in finetuning large language models NeurIPS2024 [paper]
  • [2024年10月05日] Your Task May Vary: A Systematic Understanding of Alignment and Safety Degradation when Fine-tuning LLMs preprint [Openreview]
  • [2024年10月05日] On Evaluating the Durability of Safeguards for Open-Weight LLMs ICLR2025 [Openreview] [Code]
  • [2024年11月13日] The VLLM Safety Paradox: Dual Ease in Jailbreak Attack and Defense arXiv [paper]
  • [2025年2月3日] Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities arXiv [paper]
  • [2025年3月24日] Fundamental Safety-Capability Trade-offs in Fine-tuning Large Language Models arXiv [paper]
  • [2025年5月20日] Safety Subspaces are Not Distinct: A Fine-Tuning Case Study ICLR20206 submission [paper] [Code]
  • [2025年6月30日] Foundational Models Must Be Designed To Yield Safer Loss Landscapes That Resist Harmful Fine-Tuning ICML 2025 R2-FM Workshop [paper]
  • [2025年8月08日] In-Training Defenses against Emergent Misalignment in Language Models arXiv [paper] [Code]

Benchmark

  • [2024年9月19日] Defending against Reverse Preference Attacks is Difficult arXiv [paper] [code]
  • [2025年5月31日] SafeTuneBed: A Toolkit for Benchmarking LLM Safety Alignment in Fine-Tuning arXiv [paper] [code]

Attacks and Defenses for Federated Fine-tuning

  • [2024年6月15日] Emerging Safety Attack and Defense in Federated Instruction Tuning of Large Language Models ICLR2025 [paper] [Openreview]

  • [2024年11月28日] PEFT-as-an-Attack! Jailbreaking Language Models during Federated Parameter-Efficient Fine-Tuning arXiv [paper]

  • [2025年10月08日] TamperBench: Systematically Stress-Testing LLM Safety Under Fine-Tuning and Tampering ICLR2026 Submission [paper]

Other awesome resources on LLM safety

Citation

If you find this repository useful, please cite our paper:

@article{huang2024harmful,
 title={Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey},
 author={Huang, Tiansheng and Hu, Sihao and Ilhan, Fatih and Tekin, Selim Furkan and Liu, Ling},
 journal={arXiv preprint arXiv:2409.18169},
 year={2024}
}

Contact

If you discover any papers that are suitable but not included, please contact Tiansheng Huang (thuang374@gatech.edu).

Star History

Please kindly 🌟star🌟 our repository if you find it helpful!

Star History Chart

AltStyle によって変換されたページ (->オリジナル) /