MU-Bench

Abstract

Recent advancements in Machine Unlearning (MU) have introduced solutions to selectively remove certain training samples, such as those with outdated or sensitive information, from trained models. Despite these advancements, evaluation of MU methods have been inconsistent, employing different trained models and architectures, and sample removal strategies, which hampers accurate comparison. In addition, prior MU approaches have mainly focused on singular tasks or modalities, which is not comprehensive. To address these limitations, we develop MU-Bench, the first comprehensive benchmark for MU that (i) unifies the sets of deleted samples and trained models, and (ii) provides broad coverage of tasks and data modalities, including previously unexplored domains such as speech and video classification. Our evaluation show that RandLabel and SalUn are the most effective general MU approaches on MU-Bench, and BadT and SCRUB are capable of achieving random performance on the deletion set. We analyze several under-investigated aspects of unlearning, including scalability, the impacts of parameter-efficient fine-tuning and curriculum learning, and susceptibility to dataset biases. MU-Bench provides an easy-to-use package that includes dataset splits, models, and implementations, together with a leader board to enable unified and scalable MU research.

Datasets

Dataset	Task	Domain	Modality	D
Discriminative Tasks
CIFAR-100	Image Classification	General	Image	50K
IMDB	Sentiment Classification	Movie Review	Text	25K
DDI-2013	Relation Extraction	Biomedical	Text	25K
NLVR²	Visual Reasoning	General	Image-Image-Text	62K
Speech Commands	Keyword Spotting	Commands	Speech	85K
UCF101	Action Classification	General	Video	9.3K
Generative Tasks
SAMSum	Text Summarization	Chat Dialogue	Text	14K
Celeb Profile	Text Generation	Biography	Text	183
Tiny ImageNet	Text-to-Image Generation	General	Image-Text	20K

Bold datasets are ones that have never been evaluated in unlearning.

A Unified View of Machine Unlearning Methods

We propose the task of Pre-Fine-Tuning Weight Recovery. In this paper, we tackle this task in cases where multiple LoRA fine-tuned flavors of the same source model are available. To solve this task we present Spectral DeTuning, a method that recovers Pre-FT weights of SoTA models using iterative low-rank matrix factorization

Unlike previous attacks on model alignment that attempt to recover Pre-FT capabilities, we aim to recover the exact Pre-FT weights.
Moreover, it does not require running inference through the model. This is advantageous as i) it does not require training data ii) it is highly parallelizable, e.g., on a cluster of desktop GPUs such as RTX2080 our method can recover the Pre-FT weights of a Mistral-7B model in under five minutes.

Stable Diffusion Results: Spectral DeTuning recovers the Pre-Fine-Tuning images with high precision, even when using "in the wild" LoRAs, essentially reversing the personalization fine-tuning of the LoRA model.

Vulnerability of SoTA Models

By using just 5 LoRAs taken from CivitAI, we can recover the Pre-FT Stable Diffusion weights with a vanishingly small error. As can be seen below, scaling up to a DPO aligned Mistral only requires 8 LoRAs.

Number of LoRAs for semantic convergence

Implications: SoTA LLMs that use LoRA for alignment fine-tuning are vulnerable to Pre-FT weight recovery attacks

LoWRA Bench

To stimulate research into preventing Pre-FT weight leakage and the associated risks in terms of model safety and alignment we present LoRA Weight Recovery Attack (LoWRA) Bench, a comprehensive benchmark designed to evaluate Pre-FT weight recovery methods.
Our dataset encompasses three pre-trained representative source models: a Vision Transformer (ViT) trained on ImageNet-1K, Stable Diffusion 1.5, and Mistral-7B-v0.1. Notably, these models are widely used and deployed in numerous production systems.

Broader Impact

This work uncovers a significant vulnerability in fine-tuned models, allowing attackers to access pre-fine-tuning weights. While this discovery reveals potential security risks, our primary objective is to advance the field of Machine Learning and raise awareness within the research community about the existing vulnerabilities in current models.

Instead of using the findings of this study to execute attacks, we advocate for their use by model creators to enhance the safety and security of their models. By acknowledging and addressing vulnerabilities, creators can proactively safeguard against potential threats.

Furthermore, in the discussion section, we outline potential future directions and mitigation strategies. Following established practices in the cyber security community, we emphasize the importance of open discussion and encourage the reporting of vulnerabilities. By fostering transparency and collaboration, we can collectively create a safer environment for deploying machine learning models.

@article{cheng2024mu, title={Mu-bench: A multitask multimodal benchmark for machine unlearning}, author={Cheng, Jiali and Amiri, Hadi}, journal={arXiv preprint arXiv:2406.14796}, year={2024} }

MU-Bench: A Multitask Multimodal Benchmark for Machine Unlearning