Study notes on parameter-efficient finetuning techniques

Finetuning is a way to adapt pretrained language models (LMs) to a specific task or domain. It requires attaching a task head to the model and updating the weights of the entire network. However, this process can put a strain on one’s compute budget. This becomes more true as language models get larger and larger in every release.

In this blog post, I want to share my notes on parameter-efficient finetuning techniques (PEFT). Here, we only finetune a small number of parameters while keeping most of the LM parameters frozen. As a result, PEFT allows domain adaptation at a lower compute (and storage) cost. Lastly, this blog post is not a literature review; I will only discuss methods I personally like. For each PEFT, I will talk about its overview, related works, and high-level implementation.

Finetuning is the de facto transfer learning technique, but it has become inefficient

To recap, pretrained language models like BERT (Devlin et al., 2019) contain contextualized word representations that capture the meaning of each token and its context within the text. By themselves, they’re already useful. However, language models have enjoyed greater versatility and state-of-the-art performance because of finetuning (Howard and Ruder, 2018).

Much of the pretrained LMs we use today are based on transformer networks (Vaswani et al., 2017). Let’s review its architecture, as it will help us understand the PEFT techniques later on. Recall that most transformer networks consist of a stack of encoder and decoder layers with an attention mechanism: