[Notes] (ACL2019) LSTUR: Neural News Recommendation with Long- and Short-term User Representations

5 min readJun 19, 2022

做 News Rec 一定不能沒聽過這篇論文，本文提供一個簡易上手 LSTUR 概念的介紹，幫助理解這過去曾是 SOTA 的模型！P.S. 本文大量中英文夾雜，是因為有些時候我認為用英文理解作者的原意會比較明確，如有造成閱讀上的不便，敬請見諒！

Paper Link

https://nvagus.github.io/paper/ACL19NewsRec.pdf

Observation

Users usually have both long-term preference & short-term interests. However, past news rec models usually learn single representations of users, which may be insufficient.

Idea

The interests of online users in news are very diverse. (Long Term Interest)
User interests may evolve with time and may be triggered by specific contexts or temporal demands. (Short Term Interest)
Users usually have both long-term preferences and short-term interests. However, existing news recommendation methods usually learn single representations of users, which may be insufficient.

如上圖所示，如果 User 是金州勇士隊 (Golden State Warriors) 的粉絲，那麼這個用戶可能會在幾年內閱讀很多關於這支 NBA 球隊的籃球新聞。作者將這種用戶偏好稱為 Long-Term Interest。而瀏覽電影《波西米亞狂想曲》(Bohemian Rhapsody) 的新聞會導致 User 去閱讀其他相關新聞，像是：“Rami Malek Wins the 2019 Oscar”，因為“Rami Malek” 是這部電影中的重要演員，即便該 User 以前可能從未讀過有關 “Rami Malek” 的新聞。作者將這種用戶興趣稱為 Short-Term Interest。從上例可以推斷長短期的興趣對於 News Rec 來說都很重要。

Model Architecture

Part1. News Encoder (= Title + Topic Encoder)

News Encoder’s Title Encoder

News Encoder’s Title Encoder Cheat Sheet

[Step1] 透過 GloVe 把這些 Token 先轉成對應的 Word Embeddings

[Step2] 透過 CNN 來捕捉 Local Context 並學習其對應的 Word Representation

w[i−M:i+M] is the concatenation of the em- beddings of words between position i−M and i+M. C and b are the parameters of the convolutional filters in CNN, and M is the window size.

[Step3] 利用 Word-level Attention 來做 Attention Pooling 得到 News Representation

News Encoder’s Topic Encoder

作者從這篇新聞的 Topic 和 Sub-Topic 中學習 Representation。在 MSN 新聞中，新聞文章通常標有 Topic category (e.g. Sports) 和 Subtopic category (e.g., Golf)，以幫助定位 User interests。它們可以揭露新聞的 General & Detailed topics，並傳達用戶的偏好。作者透過 id embeddings 學習這些 information：

最終把 Title, Topic, Subtopic Representation 全部 Concatenate 起來的到 News Representation：

Part2. User Encoder = LTUR + STUR

The user encoder: learn representations of users from the history of their browsed news.

LTUR: capture the user’s consistent preferences.

Learned from the embeddings of the user IDs, which are randomly initialized and fine-tuned during model training.
Denote u as the ID of a user and Wu as the look-up table for long-term user representation, the long-term user representation of this user is:

Long-term user representation

STUR: capture the user’s temporal interests.

Online users may have dynamic short-term interests in reading news articles, which may be influenced by specific contexts or temporal information demands.
Learn the short-term representations of users from their recent browsing history, and use gated recurrent networks (GRU) to capture the sequential news reading patterns. And the short-term user representation is the last hidden state of the GRU network.

p.s. Apply the news encoder to obtain the representations of these browsed articles, denoted as {e1, e2, . . . , ek}. The above formula are GRU networks. σ is the sigmoid function, ⊙ is the item- wise product, Wr, Wz and Wh ̃ are the parameters of the GRU network.

Short-term user representation

Part3. LSTUR = Combination of LTUR & STUR

1) LSTUR-ini: Using the LTUR to initialize the hidden state of the GRU network in the STUR model. 把 LTUR視為STUR的初始值！

2) LSTUR-con: Concatenating the LTUR with the STUR as the final user representation. 把LTUR以及STUR給Concat起來拿去做後續判斷！

Part4. Click Predictor

Use the simple dot production to compute the news click probability score.

probability score

Training Details

Part1. Negative Sampling

在訓練過程中會採用 Negative sampling。對於 User 瀏覽的每條新聞(a.k.a. Positive Sample)，我們從同一 Impression 中隨機抽取該 User 未點擊的 K 條新聞文章作為 Negative Samples。模型會一起預測 Positive news & K 個 Negative news 的 Click prediction scores。如此一來此新聞點擊預測問題就被重新表述為 pseudo (K + 1)-way classification task。對此，Loss 的設計很自然就是 Minimize the summation of the negative log-likelihood of all positive samples！

P is the number of positive training samples, and cn_i,k is the k-th negative sample in the same session with the i-th positive sample.

Part2. Bernoulli Masking

由於並非所有用戶都可以納入新聞推薦模型訓練 (E.g. 新的 Users)，因此在預測階段假設所有用戶在我們的模型中都有 Long-Term Representation 是不合適的。為了處理這個問題，在模型訓練階段，作者以一定的機率 p 隨機掩蓋用戶的 Long-Term Representation。當我們 Mask long-term representation 時，所有維度都設置為 0。

Thus, the long-term user representation in our LSTUR approach can be reformulated as:

B is Bernoulli distribution, and M is a random variable that subject to B(1, 1−p). We find in experiments that this trick for model training can improve the performance of our approach.

Experiments

首先 Datasets 的部分是採用 MSN News 四周的資料從 2018.12.23 到 2019.01.19。用前三週的資料 Training，其中 10% 作為 Validation，最後一週 Testing。而對於每個 Sample，作者收集過去 7 天的瀏覽歷史來學習 Short Term Representation。

P.S. Other training details from the paper: The word embedding dimension is 200. The number of filters in CNN network is 300, and the window size of the filters in CNN network is set to 3. We applied dropout to each layer in our approach to mitigate overfitting. The dropout rate is 0.2. The default value of long-term user representation masking probability p for model training is 0.5. We used Adam to optimize the model, and the learning rate was 0.01. The batch size is set to 400. The number of negative samples for each positive sample is 4. These hyper-parameters were all selected according to the results on validation set.

The performance of different methods on news recommendation

簡言之，LSTUR-ini 的效果會比 LSTUR-con 略佳！

Ablation Study

Effectiveness of Long- and Short-Term User Representation & News Encoders in STUR

Left: The effectiveness of incorporating long-tern user representations (LTUR) and short-term user rep- resentations (STUR). Right: The comparisons of different methods in learning short-term user representations from recently browsed news articles.

Effectiveness of News Title Encoders

The comparisons of different methods in learning news title representations and the effectiveness of attention machenism in selecting important words.

Encoders using CNN outperform those using LSTM. Since local contexts in news titles are more important for learning news representations.

Effectiveness of News Topic

The effectiveness of incorporating news topic and subtopic information for news recommendation.

Subtopics can provide more fine-grained topic information which is more helpful for news recommendation.

Influence of Masking Probability

The influence of mask probability p on the performance of LSTUR.

根據上圖，LSTUR-ini 和 LSTUR-con 的結果具有相似的 Pattern。當 p 從 0 增加時，兩種方法的效能都會提高。當 p 還太小時，模型將傾向於 Overfit LTUR，因為 LTUR 有較大量的參數。當 p 太大時，兩種方法的效能都會開始下降，可能是因為 LTUR 中的 useful informations 無法有效整合。對於 LSTUR-ini & LSTUR-con，p’s moderate choice (E.g. p = 0.5) 才能夠平衡 LTUR 和 STUR 的學習。

This article will be updated at any time! Thanks for your reading. If you like the content, please clip the “clap” button. You can also press the follow button to track new articles. Feel free to connect with me via LinkedIn or email.