MU-Bench

MU-Bench: A Multitask Multimodal Benchmark for Machine Unlearning

University of Massachusetts Lowell
SafeGenAI @ NeurIPS 2024

Abstract

Recent advancements in Machine Unlearning (MU) have introduced solutions to selectively remove certain training samples, such as those with outdated or sensitive information, from trained models. Despite these advancements, evaluation of MU methods have been inconsistent, employing different trained models and architectures, and sample removal strategies, which hampers accurate comparison. In addition, prior MU approaches have mainly focused on singular tasks or modalities, which is not comprehensive. To address these limitations, we develop MU-Bench, the first comprehensive benchmark for MU that (i) unifies the sets of deleted samples and trained models, and (ii) provides broad coverage of tasks and data modalities, including previously unexplored domains such as speech and video classification. Our evaluation show that RandLabel and SalUn are the most effective general MU approaches on MU-Bench, and BadT and SCRUB are capable of achieving random performance on the deletion set. We analyze several under-investigated aspects of unlearning, including scalability, the impacts of parameter-efficient fine-tuning and curriculum learning, and susceptibility to dataset biases. MU-Bench provides an easy-to-use package that includes dataset splits, models, and implementations, together with a leader board to enable unified and scalable MU research.

Prior evaluation on machine unlearning is conducted on different base models, deleted samples, and specific tasks. We benchmark machine unlearning by unifying 1 a vulnerability in fine-tuned models, wherein the pre-fine-tuning (Pre-FT) weights, i.e., the model weights before the fine

  1. Deleted Samples.
  2. Base models.
Recovering the original pre-fine-tuning unsafe weights, is implicitly assumed to be impossible. We demonstrate that this safety assumption is often false.

Recovering the Pre-Fine-Tuning Weight of an Aligned Mistral 7B

Datasets

Dataset Task Domain Modality D
Discriminative Tasks
CIFAR-100 Image Classification General Image 50K
IMDB Sentiment Classification Movie Review Text 25K
DDI-2013 Relation Extraction Biomedical Text 25K
NLVR² Visual Reasoning General Image-Image-Text 62K
Speech Commands Keyword Spotting Commands Speech 85K
UCF101 Action Classification General Video 9.3K
Generative Tasks
SAMSum Text Summarization Chat Dialogue Text 14K
Celeb Profile Text Generation Biography Text 183
Tiny ImageNet Text-to-Image Generation General Image-Text 20K

Bold datasets are ones that have never been evaluated in unlearning.

A Unified View of Machine Unlearning Methods

We propose the task of Pre-Fine-Tuning Weight Recovery. In this paper, we tackle this task in cases where multiple LoRA fine-tuned flavors of the same source model are available. To solve this task we present Spectral DeTuning, a method that recovers Pre-FT weights of SoTA models using iterative low-rank matrix factorization

Unlike previous attacks on model alignment that attempt to recover Pre-FT capabilities, we aim to recover the exact Pre-FT weights.
Moreover, it does not require running inference through the model. This is advantageous as i) it does not require training data ii) it is highly parallelizable, e.g., on a cluster of desktop GPUs such as RTX2080 our method can recover the Pre-FT weights of a Mistral-7B model in under five minutes.

Recovering the Pre-Fine-Tuning Weight of an Aligned Mistral 7B

Stable Diffusion Results: Spectral DeTuning recovers the Pre-Fine-Tuning images with high precision, even when using "in the wild" LoRAs, essentially reversing the personalization fine-tuning of the LoRA model.

Vulnerability of SoTA Models

By using just 5 LoRAs taken from CivitAI, we can recover the Pre-FT Stable Diffusion weights with a vanishingly small error. As can be seen below, scaling up to a DPO aligned Mistral only requires 8 LoRAs.

Number of LoRAs for semantic convergence

Spectral DeTuning

The core idea of Spectral DeTuning is to iteratively break down the optimization into a set of simple sub-problems which have closed-form solutions. This results in a simple yet powerful algorithm that can be implemented in 8 lines of code.

Spectral DeTuning code

LoWRA Bench

To stimulate research into preventing Pre-FT weight leakage and the associated risks in terms of model safety and alignment we present LoRA Weight Recovery Attack (LoWRA) Bench, a comprehensive benchmark designed to evaluate Pre-FT weight recovery methods.
Our dataset encompasses three pre-trained representative source models: a Vision Transformer (ViT) trained on ImageNet-1K, Stable Diffusion 1.5, and Mistral-7B-v0.1. Notably, these models are widely used and deployed in numerous production systems.

LoWRA Bench Details

Broader Impact

This work uncovers a significant vulnerability in fine-tuned models, allowing attackers to access pre-fine-tuning weights. While this discovery reveals potential security risks, our primary objective is to advance the field of Machine Learning and raise awareness within the research community about the existing vulnerabilities in current models.

Instead of using the findings of this study to execute attacks, we advocate for their use by model creators to enhance the safety and security of their models. By acknowledging and addressing vulnerabilities, creators can proactively safeguard against potential threats.

Furthermore, in the discussion section, we outline potential future directions and mitigation strategies. Following established practices in the cyber security community, we emphasize the importance of open discussion and encourage the reporting of vulnerabilities. By fostering transparency and collaboration, we can collectively create a safer environment for deploying machine learning models.

BibTeX

@article{cheng2024mu,
      title={Mu-bench: A multitask multimodal benchmark for machine unlearning},
      author={Cheng, Jiali and Amiri, Hadi},
      journal={arXiv preprint arXiv:2406.14796},
      year={2024}
    }