[Notes] (IJCAI2021) UNBERT: User-News Matching BERT for News Recommendation

Haren Lin
論文連結 Paper Link


Observations & Ideas

  1. Existing recommendation methods merely learn textual representations from in-domain news data, which limits their generalization ability to new news that are common in cold-start scenarios.
  2. Although the cold-start problem can be alleviated by pre-trained word embeddings (e.g., Word2Vec and Glove), due to their context-independency, their effectiveness may be further weaken during training with a randomly initialized downstream model.
  3. Many of these methods represent each user by aggregating the historically browsed news into a single vector and then compute the matching score with the candidate news vector, which may lose the low-level matching signals.
  4. Moreover, treating each news as a whole vector and encoding users and items separately may ignore some low-level matching signals (e.g., word-level relations) between users’ interests and candidate news.
A negative example: several news browsed by a user (upper box) and a candidate news (lower box). Orange bars represent the important signals related with green bar that should be captured.

Candidate news #1 & #3 具有很強的 Semantic similarity,因為它們都與電影相關。但在 Word-level 中,這些瀏覽過的新聞中 “Florida”, “America” 等詞語可能與 Candidate news 中的 “Japan” 「不匹配」,用戶實際上並不會去點擊這個新聞。因此,對於 User 而言,在 Location 上的 Word-level matching signal 並沒有很好地用於新聞推薦。


  1. To the best of our knowledge, UNBERT is the first work to introduce the pre-trained BERT to capture user-news matching signals for news recommendation that takes full advantage of the out-domain knowledge.
  2. UNBERT proposes the idea of representing user by raw text of the browsed news directly, and learns user-news matching representation at both word-level and news-level to capture multi-grained user-news matching signals through two matching modules.
  3. Extensive experiments on the real-world dataset show that our approach can effectively improve the performance of news recommendation.


Inputs / Embedding Layers

UNBERT Input Representation

UNBERT 輸入看起來複雜,但其實剖析後很簡單明確。對於一筆資料來說,手上會有一個 User news clicks 以及 Candidate news。

[Step1] 把所有 News (User news clicks & Candidate news) 的 Texture Source 先做 Tokenization,並得到基本的 Token Embedding。他會模擬 BERT 放兩句 Sentence 的做法:[CLS] Sent1 [SEP] Sent2。這邊 Sent1 就是放 Candidate news’ tokens,Sent2 會放所有 User news clicks,因為 Clicks 會由好幾篇的 news 組成,彼此之間會再穿插一個新定義的 [NSEP] (News SEParator)。

[Step2] 取的 Segment Embedding,依照 [CLS] Sent1 [SEP] Sent2,Candidate news [CLS] 到 [SEP] 的部分都標記為 EA,其餘 User news clicks 都標記為 EB。

[Step3] Positional Embedding 的部分,在 UNBERT 的方法中沒有使用,因作者實驗後發現效果更糟。

[Step4] News Segment Embedding,用來區分不同篇 News 的 Token 的範圍。

Token, segment and position embeddings 是使用 Mask LM 進行 Pretrain 的,方法是 Randomly mask 每個 Seqeunce 中 15% 的所有 Word-piece tokens 並預測 Masked token。News segment embedding 是隨機初始化的,並在 Fine-tune 時進一步更新。最後,把 Token Embedding + Segment Embedding + News Segment Embedding 全部相加得到的 Embedding 就是 UNBERT Input Representation。

Architecture (WLM + NLM)

UNBERT Model Architecture = WLM + NLM

Word-Level Module (WLM)

WLM applies multiple Transformer Layers (TL) iteratively to compute the hidden representations at each layer for each word (token) and propagate the matching signal at word(token)-level simultaneously. 簡言之,WLM 透過字與字之間的交互作用,來得到每個 Token 的 Embedding。

Transformer Layer c.f. Attention is all you need

News-Level Module (NLM)

NLM aggregates the word’s hidden representation of each news from WLM to the news representation, and then implement the other multiple Transformer Layers to capture the news-level matching signal. 簡言之,NLM 透過 WLM 生成的 Word Embedding 進行 Pooling,來得到每個 News 的 Embedding。

  • wi = the hidden representation for the i-th word obtained from WLM.
  • nj = the j-th news representation aggregated from its sequence of words Sj where i ∈ Sj.

Here, we obtain the news representation nj from the word representation wi using the following three types of aggregators: (共三種 Aggregation Methods)

  1. NSEP Aggregator: take [NSEP] token representation as the news embedding.

2. Mean Aggregator: averages the words’ embedding directly to form the news embedding.

3. Attention Aggregator: uses a light-weight attention network to learn the combination weights of the word embedding matrix w.

Click Predictor

Predict the probability of a user clicking a candidate news.

  • ew = word-level matching representation. (from WLM)
  • en = news-level matching representation. (from NLM)

Concatenate these two representation vectors as multi-grained matching representation before applying a full connection layer.

P.S. Training Details: In our experiments, MIND-small dataset is used to determine the parameter settings, then we train and evaluate on both small and large dataset. The bert-base-uncased is used as the pre-trained model to initialize the word-level module. We apply negative sampling with ratio 4 in consideration of being consistent with other baselines as well as the training effi- ciency, Adam is used for model optimization. The batch size is set to 128, the learning rate is set to 2e−5, and 2 epochs are trained. All the hyper-parameters are tuned on the validation set.


資料集的部分,當然又是大家熟悉的 MIND 啦~!

Statistics of the datasets


Overall Performance

The overall performance of different methods on MIND. Boldface indicates the best results (the higher, the better), while the second best is underlined. UNBERT-en△ represents the ensemble score based on UNBERT which is at the top of https://msnews.github.io/#leaderboard

Comparison for different aggregator in NLM

效能比較如同預期:NSEP < Mean < Attn Aggregator。

Ablation Study of WLM and NLM

  1. UNBERT_word and UNBERT_news achieve a pretty good performance close to the full version of UNBERT, which confirms the effectiveness of these two level matching signals.
  2. UNBERT_news outperforms UNBERT_word, which proves that the word-level is insufficient for its weakness on capturing news structure.
  3. The full version of UNBERT performs best, which tells that the multi-grained matching signal is necessary for news recommendation.

Effectiveness on Cold Start

這部分探討不同新聞推薦方法對 Unseen news 的 performance。作者將 MIND-small 依照天數切割成 7 組,Train & Validation 在前三天,後面四天做 Testing。如上圖,UNBERT 的表現 Consistently 超越其他模型!更直得注意的是,隨著時間的推移,Cold/Unseen news 越來越多,UNBERT 雖然效能開始往下掉 (especially from 11/13/2019 to 11/14/2019),但幅度並未像其他模型來的那麼誇張,某種程度上證明了 UNBERT 在處理 Cold-Start 的優勢。

