[Notes] (SIGIR2022) MM-Rec: Multimodal News Recommendation

5 min readJun 20, 2022

前方高能：這是一篇結合 Computer Vision & Natural Language Processing & Recommender System 的論文！！！CV + NLP + Recsys！！！

Paper Link

https://arxiv.org/pdf/2104.07407.pdf

Observation & Idea

Users may click news to read not only because of the interest in the content of news title, but also due to the fascination of news images.

News with images for news recommendation

News titles and images usually have some relatedness in describing news content and attracting clicks. E.g. 上圖的 2nd news clicks 標題的 Cowboy 與新聞圖像中顯示的球員有關。對它們的相關性進行建模可以幫助更好地 news modeling 並推斷 user interests。(=> Multimodel news encoder)
A user may have multiple interests, and a candidate news may only be related to a specific interest encoded in part of clicked news. E.g. 上圖的 Candidate news 僅與 2nd news click 的新聞相關。因此，對 Clicked news 和 candidate news 之間的相關性進行建模可以幫助預測用戶對 candidate news 的特定興趣。(=> Crossmodal Candidate-aware Attention)
Candidate news may have crossmodal relatedness with clicked news. E.g. 上圖中 Candidate news 的 Image 與 2nd news clicks 的 image & title 相關，因為兩張影像都顯示了同一支球隊，並且在 2nd news clicks 的標題中提到了它的名字。對 Candidate news & Clicked news之間的 Cross-model relation 進行建模可以幫助準確地衡量它們的 relatedness。(=> Crossmodal Candidate-aware Attention)

Model Architecture

Part1. Multimodal News Encoder

由於 News image 的不同區域可能對 News modeling 具有不同的 informativeness，因此作者先用在做 Object detection 的 Pre-trained model: Mask-RCNN 模型來提取 News image 的 Region Of Interest (ROI)。接著再用 ResNet-50 來提取 ROI 的 Features。

News image feature sequence, where 𝐾 is the number of ROIs.

接著是 Textual Source 的建模，作者只拿 News title 來做 Modeling。首先先把 News title 進行 Tokenization 得到 Word Sequences.

Word Sequence, where 𝑀 is the number of words.

一個直觀的方法是使用各別的的模型對 News Title & Image 進行獨立建模。但同一條新聞的 Title 和 Image 通常有一定的關係。捕捉 News Title 和 Image 之間的相關性有助於更好地理解其內容並推斷 User interests。而 Visiolinguistic models 對 Cross-model 這樣的任務非常有效，因此作者決定套用 ViLBERT 來捕捉 News title 和 image 的 inherent relatedness，以利學習 News representation。

ViLBERT 拿到的輸入資料就是剛剛上面講的 ROI News image feature sequence & Word sequence。首先會通過幾個 Vanilla Transformer 對 Contexts 建模，然後使用幾個 Co-Attention Transformer 來捕獲 News image & title 之間的 Cross-model interactions。而輸出的部分則為：hidden ROI representation sequence Hp & hidden word representation sequence Ht。

有了 News title & News image 的 Hidden representation 後，接著會分別套用 Word attention network 和 Image attention network 來學習 News title representations 以及 News image representations。

Image attention network. qp is an attention query vector and Wp is a parameter matrix

Word attention network. qt is an attention query vector and Wt is a parameter matrix

最後用 Attention Pooling 得到 News image representations, rp 以及 News title representations, rt.

Part2. Multimodal News Recommendation

Previously clicked news from their titles, 𝑃 is the number of clicked news

Previously clicked news from their images, 𝑃 is the number of clicked news

Selecting clicked news according to their relevance to candidate news in user modeling may help accurately match candidate news with user interest.
Candidate news may have some cross-modal relations with the images and titles of clicked news.

由於上面兩點的啟發，作者提出 Cross-model candidate-aware attention network。

image and text representations of the candidate news Dc

a) Text-text attention weights for clicked news.

b) Text-image attention weights for clicked news.

c) Image-text attention weights for clicked news.

d) Image-image attention weights for clicked news.

透過這四個 Attention weight 得到 Unified user embedding u:

Unified user embedding

Part3. Click Score

Click Prediction Score

P.S. Trained with negative sampling & cross-entropy loss.

Datasets

其實對於這樣子影像與文字資料結合的 News Rec，目前並沒有一個公開的資料集可以作為 Benchmark，對此作者自己從 commercial news website 爬了三週的 logs 出來，從 2020.02.25 到 2020.03.16，前一週的資料拿來做 User histories，後兩週的資料拿來做 Click Samples，而 Training data 為前 1M 筆資，後面再各自拿 100K 做 Validation & Testing。

P.S. Training details from the paper: In our experiments, we finetuned the last three layers of ViLBERT. We used Adam as the optimizer (lr=1e-5). The batch size was 32. We tuned hyperparameters on the validation set. We repeated each experiment 5 times and reported the average AUC, MRR, NDCG@5 and NDCG@10 scores.

Experiments

Performance comparison of different methods

在上述的其他非 MM-Rec 的模型，作者在 News Encoder 使用 BERT，而且他們都只有使用文字資料，並沒有使用圖片作為額外的 Feature。可以從數據上明顯看到，引入圖片的資料能輔助 News Rec 的預測 (t-test p < 0.01)。

Ablation Study

News title & image 對於學習 News representation 以進行推薦都很有用。有趣的是：即便 MM-Rec 不引入圖片 Module 作為訓練，他的表現仍然比 PLM-NR 等等其他的 Baseline 還要好，因為 ViLBERT 模型在 multi-modal data 上進行了預訓練，可以利用視覺信號來增強文本理解。此外，結合 multi-modal news information 可以進一步提高推薦效果，這表示他有助於學習準確的 News representation。

Effect of the co-attentional Transformers in ViL-BERT and cross-modal candidate-aware attention

作者也研究了 ViLBERT’s Co-Attention Transformers 和 Cross-modal candidate-aware attention network 在 User interests 建模中的有效性。作者設計兩種變形，一種是沒有 Co-Attention Transformers 的 MM-Rec，另一種是用 Vanilla Attention 取代 Cross-modal candidate-aware attention network 的 MM-Rec。如上圖，當然是同時引入兩者的 MM-Rec 表現會是最佳，因為 News title & image 在代表新聞內容和吸引新聞點擊方面存在內在的 hidden relation。因此，對它們的 interaction 進行建模可以增強它們的 representation。再者，Cross-modal candidate-aware attention network 很有用，因為不同的 historical news clicks 對於 user interests 通常具有不同的 importance。

Ablation studies on different embeddings

作者發現：當刪除任何 Candidate news embedding 或是 User embedding，都會造成效果下降，這就表示 Textual & Visual information 對於 News & User interests modeling 都很有用。此外，Textual information 佔有更重要的作用。這是一個非常有趣的現象，因為 Textual Source 通常不如 Images 有吸引力。作者認為這主要是因為單個新聞影像通常無法全面概括新聞內容，透過視覺信息理解整個新聞可能是具有挑戰性。

Case Study

The clicked news of a user and the rankings of candidate news given by NRMS and MM-Rec. Only the first candidate news is clicked.

從上圖可以發現，NRMS 和 MM-Rec 都將最後的 Candidate news 給出較低的排名，因為從它的標題我們可以很容易地推斷出它與 User interests 無關。但是，NRMS 模型並未能成功推給 User 第一條 Candidate news，這篇新聞與 User clicks 第二篇 NFL 的新聞高度相關，但很難僅根據標題來衡量它們的相關性。但作者的 MM-Rec 將第一個 Candidate news 排在首位推給 User，因為它很容易根據視覺信息與 User interest 相匹配。這些結果表明了 Multi-model information 在 News rec 的有效性。

This article will be updated at any time! Thanks for your reading. If you like the content, please clip the “clap” button. You can also press the follow button to track new articles. Feel free to connect with me via LinkedIn or email.