[Notes] (IJCAI2021) UNBERT: User-News Matching BERT for News Recommendation

Haren Lin
5 min readJun 17, 2022

--

本文擷取論文重要片段進行筆記,個人認為這篇論文在鋪陳上/說故事上面很有邏輯,推薦有興趣的人可以翻翻原文!

論文連結 Paper Link

https://www.ijcai.org/proceedings/2021/0462.pdf

Observations & Ideas

  1. Existing recommendation methods merely learn textual representations from in-domain news data, which limits their generalization ability to new news that are common in cold-start scenarios.
  2. Although the cold-start problem can be alleviated by pre-trained word embeddings (e.g., Word2Vec and Glove), due to their context-independency, their effectiveness may be further weaken during training with a randomly initialized downstream model.
  3. Many of these methods represent each user by aggregating the historically browsed news into a single vector and then compute the matching score with the candidate news vector, which may lose the low-level matching signals.
  4. Moreover, treating each news as a whole vector and encoding users and items separately may ignore some low-level matching signals (e.g., word-level relations) between users’ interests and candidate news.
A negative example: several news browsed by a user (upper box) and a candidate news (lower box). Orange bars represent the important signals related with green bar that should be captured.

Candidate news #1 & #3 具有很強的 Semantic similarity,因為它們都與電影相關。但在 Word-level 中,這些瀏覽過的新聞中 “Florida”, “America” 等詞語可能與 Candidate news 中的 “Japan” 「不匹配」,用戶實際上並不會去點擊這個新聞。因此,對於 User 而言,在 Location 上的 Word-level matching signal 並沒有很好地用於新聞推薦。

Contribution

  1. To the best of our knowledge, UNBERT is the first work to introduce the pre-trained BERT to capture user-news matching signals for news recommendation that takes full advantage of the out-domain knowledge.
  2. UNBERT proposes the idea of representing user by raw text of the browsed news directly, and learns user-news matching representation at both word-level and news-level to capture multi-grained user-news matching signals through two matching modules.
  3. Extensive experiments on the real-world dataset show that our approach can effectively improve the performance of news recommendation.

Model

Inputs / Embedding Layers

UNBERT Input Representation

UNBERT 輸入看起來複雜,但其實剖析後很簡單明確。對於一筆資料來說,手上會有一個 User news clicks 以及 Candidate news。

[Step1] 把所有 News (User news clicks & Candidate news) 的 Texture Source 先做 Tokenization,並得到基本的 Token Embedding。他會模擬 BERT 放兩句 Sentence 的做法:[CLS] Sent1 [SEP] Sent2。這邊 Sent1 就是放 Candidate news’ tokens,Sent2 會放所有 User news clicks,因為 Clicks 會由好幾篇的 news 組成,彼此之間會再穿插一個新定義的 [NSEP] (News SEParator)。

[Step2] 取的 Segment Embedding,依照 [CLS] Sent1 [SEP] Sent2,Candidate news [CLS] 到 [SEP] 的部分都標記為 EA,其餘 User news clicks 都標記為 EB。

[Step3] Positional Embedding 的部分,在 UNBERT 的方法中沒有使用,因作者實驗後發現效果更糟。

[Step4] News Segment Embedding,用來區分不同篇 News 的 Token 的範圍。

Token, segment and position embeddings 是使用 Mask LM 進行 Pretrain 的,方法是 Randomly mask 每個 Seqeunce 中 15% 的所有 Word-piece tokens 並預測 Masked token。News segment embedding 是隨機初始化的,並在 Fine-tune 時進一步更新。最後,把 Token Embedding + Segment Embedding + News Segment Embedding 全部相加得到的 Embedding 就是 UNBERT Input Representation。

Architecture (WLM + NLM)

UNBERT Model Architecture = WLM + NLM

Word-Level Module (WLM)

WLM applies multiple Transformer Layers (TL) iteratively to compute the hidden representations at each layer for each word (token) and propagate the matching signal at word(token)-level simultaneously. 簡言之,WLM 透過字與字之間的交互作用,來得到每個 Token 的 Embedding。

Transformer Layer c.f. Attention is all you need

News-Level Module (NLM)

NLM aggregates the word’s hidden representation of each news from WLM to the news representation, and then implement the other multiple Transformer Layers to capture the news-level matching signal. 簡言之,NLM 透過 WLM 生成的 Word Embedding 進行 Pooling,來得到每個 News 的 Embedding。

  • wi = the hidden representation for the i-th word obtained from WLM.
  • nj = the j-th news representation aggregated from its sequence of words Sj where i ∈ Sj.

Here, we obtain the news representation nj from the word representation wi using the following three types of aggregators: (共三種 Aggregation Methods)

  1. NSEP Aggregator: take [NSEP] token representation as the news embedding.

2. Mean Aggregator: averages the words’ embedding directly to form the news embedding.

3. Attention Aggregator: uses a light-weight attention network to learn the combination weights of the word embedding matrix w.

Click Predictor

Predict the probability of a user clicking a candidate news.

  • ew = word-level matching representation. (from WLM)
  • en = news-level matching representation. (from NLM)

Concatenate these two representation vectors as multi-grained matching representation before applying a full connection layer.

P.S. Training Details: In our experiments, MIND-small dataset is used to determine the parameter settings, then we train and evaluate on both small and large dataset. The bert-base-uncased is used as the pre-trained model to initialize the word-level module. We apply negative sampling with ratio 4 in consideration of being consistent with other baselines as well as the training effi- ciency, Adam is used for model optimization. The batch size is set to 128, the learning rate is set to 2e−5, and 2 epochs are trained. All the hyper-parameters are tuned on the validation set.

Datasets

資料集的部分,當然又是大家熟悉的 MIND 啦~!

Statistics of the datasets

Experiments

Overall Performance

The overall performance of different methods on MIND. Boldface indicates the best results (the higher, the better), while the second best is underlined. UNBERT-en△ represents the ensemble score based on UNBERT which is at the top of https://msnews.github.io/#leaderboard

Comparison for different aggregator in NLM

效能比較如同預期:NSEP < Mean < Attn Aggregator。

Ablation Study of WLM and NLM

  1. UNBERT_word and UNBERT_news achieve a pretty good performance close to the full version of UNBERT, which confirms the effectiveness of these two level matching signals.
  2. UNBERT_news outperforms UNBERT_word, which proves that the word-level is insufficient for its weakness on capturing news structure.
  3. The full version of UNBERT performs best, which tells that the multi-grained matching signal is necessary for news recommendation.

Effectiveness on Cold Start

這部分探討不同新聞推薦方法對 Unseen news 的 performance。作者將 MIND-small 依照天數切割成 7 組,Train & Validation 在前三天,後面四天做 Testing。如上圖,UNBERT 的表現 Consistently 超越其他模型!更直得注意的是,隨著時間的推移,Cold/Unseen news 越來越多,UNBERT 雖然效能開始往下掉 (especially from 11/13/2019 to 11/14/2019),但幅度並未像其他模型來的那麼誇張,某種程度上證明了 UNBERT 在處理 Cold-Start 的優勢。

This article will be updated at any time! Thanks for your reading. If you like the content, please clip the “clap” button. You can also press the follow button to track new articles. Feel free to connect with me via LinkedIn or email.

--

--

Haren Lin

MSWE @ UC Irvine | MSCS @ NTU GINM | B.S. @ NCCU CS x B.A. @ NCCU ECON | ex-SWE intern @ TrendMicro