[Notes] (SIGIR2022) FUM: Fine-grained and Fast User Modeling for News Recommendation

11 min readMay 17, 2022

Paper Link

https://arxiv.org/pdf/2204.04727.pdf

Observation

Existing methods (E.g. NRMS, NPA, LSTUR) usually first encode user’s clicked news into news embeddings independently and then aggregate them into user embedding. However, the word-level interactions across different clicked news from the same user, which contain rich detailed clues to infer user interest, are ignored by these methods.

現有的方法（例如：NRMS、NPA、LSTUR）通常首先將用戶點擊的新聞獨立 encode 成 news embedding，然後將它們 aggregate 到 user embedding 中。然而，這些方法忽略了來自同一用戶的不同點擊新聞之間的 word level interaction，這些 interactions 包含豐富的詳細線索來推斷 user interests。

Existing Methods (Left: NRMS; Right: NPA)

Idea

審視這篇論文作者的論點：對於同一個user來說，不同點擊新聞之間的 word 彼此會有關聯。像是從上表的記錄我們可以推斷，從第一個新聞的 Iron Man 跟第二篇新聞的 Movies 推出這個 user 應該喜歡的是電影相關的東西。此外，也可以看出一些使用者潛藏的喜好像是音樂這個主題，從第四篇新聞的 Adele 跟第五篇的 Songs 來推斷而得。由這些例子來看作者的主張應該是滿有道理的，那在過去的演算法中，像是常見的 Baseline Models (i.e. NPA, NRMS) 似乎沒有顧慮到這個方向。

Model Architecture

那接下來我們就看看作者是怎麼處理這個 issue 的，以下是模型的大架構。模型一樣會有 News Encoder, User Encoder, & Click Predictor 的部分。不一樣的是在 User Encoder 的部分，除了有過去傳統做法，在這邊稱為 Coarse-grained user model，還會有另一邊的 Fine-grained model 來做 modeling。兩個 sub-module 結合起來的東西才會是最終 User representation。

進入模型細節之前我們先了解一下這裡的 Notation。簡單來說，對於每一篇Article我們能拿到的 textual features 有 k 個，包含 titles, entities 等等的特徵。那每個 textual 特徵包含很多 tokens，我們這邊這定最大長度為這個小寫的 l。所以說小寫 ti,j 代表的就是第 i 個 textual sequence 裡面的第 j 個字。然後對於 User u 的歷史點擊紀錄，我們會抓 m 筆的資料來看。透過這些 Features 我們希望能夠預測接下來某個 User 會不會點擊某篇 Candidate News。

Part1. Fine-grained and Fast User Modeling

接下來就是這篇論文模型重頭戲的地方，如同剛剛所說的 FUM 包含 Coarse grained & Find grained，這邊先細講 Fine-grained and Fast User Modeling。簡單來說他要做的事情就是從 User 的這些點擊紀錄裡面來抓 word-level interaction，這邊包含 intra 以及 inter-news interactions 並得到 user interests。

第一步，先將這些 Textual Sequence 轉 Embedding 向量來表示，那前面有提到我們有 k 個 Textual Source，每個 Sourse 最大長度是 l (這是英文 L 不是阿拉伯數字1)，總共有 m 筆新聞資料，所以我們會得到 mkl*d 這個大小的 word representations, T。

第二步，把這 m 筆資料的 embeddings 全部給 concat 起來，並且加入 Positional Encoding 變成大 H 這個 matrix。Shape = (L, g)，L = mkl，g = d + d (token_embed_dim + position_embed_dim)。

第三步，借助 FastFormer 的幫忙，把這些 Tokens 傳進去計算，得到每個 Token 跟 Global Contexts 之間的關係。這邊我們會把數據分成好幾個 head 去做 transform，每個 head 都會有輸入 hi，最後輸出 hi hat。最後把所有 head 算出來的 representation, hi hat 給 concat 起來得到 gi。

第四步，把這些經過 Global Context Transformation 後的字，去用 Attention Pooling 得到每個 News 的 Representation。(Form news embedding via word attentions)

第五步，用一層 Attention 得到 Fine-grained user representation (uf)。 (Form user embedding via news attentions)

Part2. Coarse-grained User Modeling

接著就是 Coarse-grained 的部分，這塊在做的事情是是透過 news-level 的 interaction 來得到 user interests！

第一步，用 News encoder 把各種 textual source 的東西傳進去 transformer 得到 Texture source sentence level 的 Representation，然後把他們做 Attention 給 Weighted sum 起來，就得到一篇 News 的 Representation了。

第二步，把這些 News representation 再傳進去一個 Transformer 得到每一篇歷史點擊新聞的表示 [c1, c2, …, cm]，視為他們的 Contextualized Embeddings。

第三步，就是把這些 Contextualized Embeddings [c1, c2, …, cm] 傳進去一個 Attention network 然後就得到 Coarse-grained 的 User representation (uc)。

Part3. News-User Matching

將剛才得到的 Fine-grained user representation 以及 Coarse-grained user representation 相加，得到我們最終的 User representation。

而當現在我們有一篇 Candidate news 來的時候，就把這篇 Candidate news 拿去過剛剛的 Coarse-grained 裡面的 News encoder 得到他的 Candidate news representation，而利用 Inner product 來得到 Matching Score (r)。

最後，透過 BPR Loss 的方法來更新模型參數，直到收斂。

Experiments

接著進到實驗的部分，首先要先說明一下，這篇是 SIGIR 2022 被 Accpected 的論文，從下方 Table1 看原因應該是顯而易見～！值得注意的是他並不是 SOTA，Performance 比他好的人大有人在。

我們可以稍微比較一下模型的表現，像是先前的 Baseline NPA, LSTUR, NRMS 等等的，他們的表現其實都有 67, 68% AUC 的水準，但就像作者說的，他們「只能」捕捉到 News-level 的 interaction 來 model user interests。但是在這篇 FUM 上面，他可以一路從 Word-level 觀察到 News-level 來做 Modeling，效果相對來說有所提升。

另外還有一個值得一提的是他的速度表現，作者表示雖然他們把很多Textual source 都引入近來，而且也把他們 concat 起來作為一個超長的 Long document，但因為引入了 SOTA Fastformer 架構，讓他們的 Training 跟 Inference time 都可以維持一定的水準之上。

Ablation Study

最後，作者也有做 Ablation study 來比較 Performance 跟驗證他們的想法。先看左邊這張圖，我們可以發現在這三個指標上面，只要我們把 Fine-graned model 拿掉，performance 就會掉。這是因為來自同一 User 的不同點擊新聞之間的 Fine-Grained 通常包含豐富的線索來了解用戶的興趣。Fine-grained user modeling 可以有效地捕捉 word-level interactions 並更好地建模 user interests。接著，如果把 Coarse-grained model 拿掉的話，效果會掉更多，這是因為原先 Intra-news behavior interaction 本來就對 news rec user modeling 非常重要。

還有一個重點是，我們可以看到如果單純放 Coarse-grained model 的話，其實效果會比單純放 Fine-grained 來得好，畢竟 Fine-grained model 是錦上添華，目的本身是來做 Inter-news 的 interaction，不利於 Intra-news interaction 的學習。

最後看右圖，作者將不同的 Transformer 應用於 FUM 來觀察它們的影響。除了 FastFormer，我們還應用了另外兩個 SOTA Transformer (i.e. LongFormer, PoolingFormer)。從圖表可以知道，首先，有使用 Transformer 的 FUM 始終優於 Baseline models 的表現，這驗證了 Fine-grained user modeling 的重要性。其次，Fastformer 比其他 Transformer 顯著提高了 FUM 的效率。因此最後作者選擇 FastFormer 用於 FUM 中的 Fine-grained user modeling。

This article will be updated at any time! Thanks for your reading. If you like the content, please clip the “clap” button. You can also press the follow button to track new articles. Feel free to connect with me via LinkedIn or email.