Home

Omar Hosney

PEFT (Parameter-Efficient Fine-Tuning) 🤗 Cheat Sheet

1. Introduction to PEFT 🚀

PEFT reduces computational and storage costs by fine-tuning fewer parameters.
💻 Enables the training of large models on consumer hardware, making AI more accessible.
🤖 Maintains performance comparable to fully fine-tuned models.
🔗 Seamless Integration: Works with Hugging Face libraries like Transformers, Diffusers, and Accelerate.

2. PEFT Methodologies 🎯

📝 Soft Prompting: Adds learnable parameters to input embeddings to optimize tasks while keeping model parameters frozen.
📉 LoRA (Low-Rank Adaptation): Uses low-rank matrices to reduce memory usage and computational cost by limiting the number of trainable parameters.
🔗 IA3 (Integrated Attention Activation Adapters): Multiplies model activations by three learnable vectors, minimizing parameter changes.

3. Adapter Methods 🧩

🌍 Adapters: Small neural networks inserted into layers of a pretrained model, allowing task adaptation without altering the base model.
🧠 X-LoRA: Uses multiple LoRA adapters for fine-tuning a model on different tasks simultaneously, enhancing flexibility and efficiency.

4. Quick Tour of PEFT 🚀

🖥️ Install PEFT: Run pip install peft or install from the GitHub repository for the latest features.
🛠️ Configuration: Define specific settings, such as the dimension of LoRA matrices, using LoraConfig or PromptEncoderConfig.
💾 Save Model: Use save_pretrained() to save only additional weights, ensuring efficient storage.
🔍 Load Model for Inference: Use from_pretrained() to load a trained model efficiently.

5. Advanced Applications 🌟

🖼️ Integration with Diffusers: Manage multiple adapters for generative AI tasks, such as creating images and videos from text prompts.
🧠 Integration with Transformers: Efficiently train large-scale language models for various NLP tasks using adapters.
✏️ Soft Prompting Methods: Learn task-specific prompts dynamically by adding learnable parameters to input embeddings.

6. Advanced Configurations 🛠️

🧩 Create Custom Configurations: Tailor PEFT methods to specific needs by creating configurations like LoraConfig.
📚 API References: Explore detailed API references for methods and classes to fine-tune models effectively.

7. Model Merging & Quantization 🛠️

🧩 TIES & DARE: Efficiently merge models by eliminating redundant parameters using trimming and rescaling techniques.
⚙️ Quantization: Use fewer bits to represent data, reducing memory usage and accelerating inference for large language models.
🔍 QLoRA: Combines quantization with LoRA to fine-tune large models, making it possible to use them on limited hardware.

Different Adaptors

Low-Rank Adaptation (LoRA) ✨

LoRA represents weight updates using low-rank matrices.
Keeps pretrained weights frozen, reducing trainable parameters.
Combines original and adapted weights for final results.
Efficient and comparable to full fine-tuning.
Typically applied to attention blocks in Transformer models.

Mixture of LoRA Experts (X-LoRA) 🤖

X-LoRA uses dense/sparse gating to activate experts dynamically.
Only the gating layers are trained, keeping the parameter count low.
Allows the model to reconfigure dynamically during inference.
Requires a dual forward pass for effective knowledge mixing.

Low-Rank Hadamard Product (LoHa) 🧩

LoHa enhances model expressivity using Hadamard product.
Utilizes four smaller matrices for higher rank without extra parameters.
Originally developed for computer vision, adapted for diffusion models.

Low-Rank Kronecker Product (LoKr) 🔗

LoKr uses Kronecker product for parameter-efficient finetuning.
Maintains the original weight matrix's rank.
Can be vectorized for faster processing.

Orthogonal Finetuning (OFT) 🎯

OFT preserves pretrained model's generative performance.
Maintains cosine similarity between neurons for semantic preservation.
Utilizes a sparse block-diagonal matrix to be parameter-efficient.

Orthogonal Butterfly (BOFT) 🦋

BOFT focuses on maintaining pretrained model's structure.
Uses an orthogonal matrix for transformations.
Ensures minimal change in model’s latent space.

Adaptive Low-Rank Adaptation (AdaLoRA) 🛠️

AdaLoRA allocates parameters based on task importance.
Uses SVD-like techniques to control rank dynamically.
Prunes less important parameters for efficiency.

Llama-Adapter 🦙

Llama-Adapter adapts models for instruction-following.
Uses learnable prompts to guide higher-level semantics.
Zero-initialized attention prevents overwhelming pretrained knowledge.

Soft Prompts

📌 Prompt Tuning

Trains only a small set of task-specific prompt parameters.
Designed for text classification on T5 models as text generation tasks.
Prompt tokens have independent parameters updated separately.
Keeps the pretrained model frozen and updates only the prompt embeddings.
Performance is comparable to full model training.

📌 Prefix Tuning

Optimizes prefix parameters for each task.
Works with natural language generation tasks on GPT models.
Prefix parameters are inserted at all layers of the model.
Uses a separate feed-forward network (FFN) for optimization.
Comparable to full finetuning with 1000x fewer parameters.

📌 P-Tuning

Suitable for natural language understanding tasks.
Uses a prompt encoder (LSTM) to optimize prompts.
Prompt tokens can be inserted anywhere in the input sequence.
Only adds tokens to the input, not to every layer.
Improves performance with anchor tokens.

📌 Multitask Prompt Tuning

Enables parameter-efficient transfer learning.
Learns a single prompt for multiple tasks.
Consists of source training and target adaptation stages.
Uses Hadamard product for generating task-specific prompts.
Trains a shared prompt matrix across all tasks.

IA3 and BOFT 📝

IA3 Overview 🚀

IA3 makes fine-tuning more efficient by using learned vectors to rescale inner activations.
Only trainable parameters are the learned vectors; original weights remain frozen.
IA3 drastically reduces the number of trainable parameters to about 0.01% for T0.
Performance is comparable to fully fine-tuned models without adding inference latency.

IA3 in Practice 💡

Injected in the attention and feedforward modules of transformers.
Targets outputs of key and value layers and input of the second feedforward layer.
Implemented using IA3Config to control how IA3 is applied.
Example for sequence classification in a Llama model using peft_config.

OFT and BOFT Overview ⚙️

OFT uses an orthogonal matrix to transform pretrained weights.
BOFT generalizes OFT using Butterfly factorization for greater efficiency.
Uses multiplicative updates for weight matrices, preserving pretraining knowledge better.
Efficiently reduces the number of trainable parameters while maintaining model performance.

BOFT Key Features 🔑

Uses Butterfly factorization to parameterize the orthogonal matrix.
Structural constraint maintains hyperspherical energy to prevent knowledge forgetting.
Supports flexible and parameter-efficient finetuning for various downstream tasks.
Can merge weights with base model using merge_and_unload().

BOFT Parameters 📊

boft_block_size: Determines sparsity of update matrices.
boft_block_num: Specifies number of blocks across layers.
boft_n_butterfly_factor: Defines the number of butterfly factors.
boft_dropout: Probability of multiplicative dropout.

Example Usage 🛠️

Configure for image classification using BOFTConfig.
Set parameters like boft_block_size and target_modules.
Integrate with transformers library and PEFT for training.