[Notes] ALBERT: A Lite BERT For Self-Supervised Learning Of Language Representations

10 min readJul 3, 2021

image source: https://medium.com/syncedreview/googles-albert-is-a-leaner-bert-achieves-sota-on-3-nlp-benchmarks-f64466dd583

看完了 BERT 的架構之後，你會發現它真的很 Powerful。但他其實還是有缺點的，其中一點是他的參數絕對上來看真的還是很多，即使是 BERT-Base，還是有 110M 個參數。因此以下要介紹的模型，就針對參數這個部分來修正，提供輕量化的 BERT，稱之為 A Lite BERT, ALBERT。

論文連結 / Paper Link

https://arxiv.org/pdf/1909.11942.pdf

摘要與簡介 / Abstract and Introduction

Increasing model size when pretraining natural language representations often results in improved performance on downstream tasks. However, at some point further model increases become harder due to GPU/TPU memory limitations and longer training times. To address these problems, we present two parameter- reduction techniques to lower memory consumption and increase the training speed of BERT (Devlin et al., 2019). Comprehensive empirical evidence shows that our proposed methods lead to models that scale much better compared to the original BERT. We also use a self-supervised loss that focuses on modeling inter-sentence coherence, and show it consistently helps downstream tasks with multi-sentence inputs. As a result, our best model establishes new state-of-the-art results on the GLUE, RACE, and SQuAD benchmarks while having fewer parameters compared to BERT-large. The code and the pretrained models are available at https://github.com/google-research/ALBERT.

在預訓練自然語言表示時增加模型大小通常會提高下游任務的性能。然而，在某些時候，由於 GPU/TPU 內存限制和更長的訓練時間，進一步增加模型變得更加困難。為了解決這些問題，ALBERT 使用兩種使參數減少技術來降低記憶體消耗並提高訓練速度。與原始 BERT 相比，作者提出的方法對模型的擴展性更好。我們還使用自監督損失 SOP，對於句子連貫性建模，並證明其對於多句輸入的下游任務有幫助。因此，ALBERT 的最佳模型在 GLUE、RACE 和 SQuAD 測試中有更好結果，同時與 BERT-Large 相比具有更少的參數。

簡單來講，他們用兩個減少參數的方法，搭配取代的 NSP 的新任務 SOP，來增進模型表現。本文將分三部分詳細介紹。

方法 / Methodology

Method #1. Factorized Embedding Parameterization / 對 Input Embedding Layer 的參數進行分解

從 BERT 原始的輸入來看，我們首先將輸入的序列，轉到字典中對應的 ID，接著從 BERT Token Embedding 中取出其對應的詞向量，而每個詞向量的維度 Embedding Size (E) 跟 Hidden Dimension (H) 一樣大，都是 768。簡單來講，BERT 把 E 跟 H 兩個緊緊綁在一起，而這樣會造成一個狀況是當我字典的單詞越來越多，參數也會跟著過多。

在 ALBERT 的做法中，他們選擇先把 Embedding Size 降低成 128，再把你抽出來的 Embedding 進行 Linear Transform 轉成 768 維度下去跑 Encoder 訓練。用簡單的數學比較一下，原始 BERT 在 Embedding 這塊有 30000*768 約等於 23M 個參數，但現在改成 30000*128 + 128*768 = 將近 4M 個參數，相差很大。如此一來我們能更明白原文中提及的 O(V × H) 複雜度降低成 O(V × E + E × H)。

原文重點：Instead of projecting the one-hot vectors directly into the hidden space of size H, we first project them into a lower dimensional embedding space of size E, and then project it to the hidden space. By using this decomposition, we reduce the embedding parameters from O(V × H) to O(V × E + E × H).

下方圖表可以觀察一下，在不同的 Embedding Size 設計下，效能表現如何。

The effect of vocabulary embedding size on the performance of ALBERT-base.

Method #2. Cross-layer Parameter Sharing / 共用 Encoder Layer 的參數

在 BERT 原文當中，以 BERT-Base 來看他的 12 層 Encoder Block 的參數都是不同的，彼此獨立。只要我把模型變大一層，整個參數量就會激增。為了解決這樣的問題，在 ALBERT 中，模型共享所有 Encoder Block 的參數，作者表示可以大幅減少參數量，並穩定訓練。說白了，就是重複疊 12 層的 Idetical Encoder Block。

Total parameters with different hyper-parameters

對於共享參數的差異，原文也提供數據比較，如下表。我們能發現，即使共享所有 Encoder Block 的參數，雖然效果略差一點，但真的差不了多少，相去最大平均也才2%。不過，這裡我們可以看到對於 E = 768 的版本，參數不共享的效果是最好的，但對於 E = 128 ，只共享 Attention 才是最好的。這裡作者只點出了這個情況，但是沒有分析具體的原因為何。

The effect of cross-layer parameter-sharing strategies, ALBERT-base configuration

Method #3. Inter-Sentence Coherence Loss / 用 SOP 取代 NSP

在 BERT 原文中有兩個損失函數，MLM 與 NSP。MLM是指按照一定的機率選擇句子，在選擇的句子中隨機選擇單詞進行遮罩 [Mask]，讓模型學習預測出被遮掉的詞是什麼。NSP 是指預測當前這兩個句子是否為上下句。但是NSP 損失函數有個問題，在建立 Negative Sample 時，Sentence A & Sentence B 其實是來自不同的文章的，這造成模型其實沒有學習到句子與句子之間的連貫性，模型只需要學習到文章的主題不同即可判斷是否為上下句。明顯的，NSP 並沒有在 MLM 的基礎上增加模型學習的難度。

在 ALBERT 的預訓練中，我們用 Sentence-Order Prediction (SOP) 取代 NSP，解決BERT的上述問題。SOP 在構建正樣本時，和NSP相同，取同一篇文章的上下句，但是在建立 Negative Sample 時，將 Positive Sample 的 Sentence A & Sentence B兩個句子順序對調即可，讓模型學習在同一篇文章中如何判斷句子的先後順序。SOP 增加了模型的學習難度，可以讓模型很好的捕捉到同一篇文章句子之間的內部連貫性，可以看到對於下游任務來說，使用 SOP 預訓練的模型比 NSP 預訓練的模型效果好，這也許是 ALBERT 即便參數少，但是效果比 BERT 好的主因。另外，在預訓練階段，可以發現基於 SOP 的模型可以很好地解決 NSP 問題，但是基於 NSP 的模型確無法解決 SOP 問題，只有52%，以 Binary Classification 來說，和 Toss a coin 不多。

The effect of sentence-prediction loss, NSP vs. SOP, on intrinsic and downstream tasks.

BTW, ALBERT 在 MLM 預訓練任務採用的是 N-Gram 的預測，N ≤ 3。使用與 BERT 相同的資料集，LAMB Optimizer (lr = 0.00176)，Batch Size 設定為 4096，總訓練步數為 125,000。

以訓練速度與參數總量來總結比較：

Dev set results for models pretrained over BOOKCORPUS and Wikipedia for 125k steps

參考資料 / Reference

1. https://www.youtube.com/watch?v=lluMBz5AoOg
2. https://kknews.cc/code/gp6a3xm.html
3. https://zhuanlan.zhihu.com/p/88099919

This article will be updated at any time! Thanks for your reading. If you like the content, please click the “clap” button. You can also press the follow button to track new articles at any time. Feel free to contact me via LinkedIn or email.