[Notes] (SIGIR2022) Positive, Negative and Neutral: Modeling Implicit Feedback in Session-based News Recommendation

Haren Lin

11 min readJun 24, 2022

這篇真的是🐮🍺🐮🍺！跳脫過往的 Seq Rec，直接轉成 Session-Based Rec！

Paper Link

https://arxiv.org/pdf/2205.06058.pdf

Source Code

GitHub - summmeer/session-based-news-recommendation: source code of paper "Positive, Negative and…

source code of paper "Positive, Negative and Neutral: Modeling Implicit Feedback in Session-based News Recommendation"…

github.com

Observation & Idea

Previous works tend to formulate session-based recommendation as a next item prediction task, while they neglect the implicit feedback from user behaviors, which indicates what users really like or dislike.
The authors propose a comprehensive framework to model user behaviors through positive feedback (i.e., the articles they spend more time on) and negative feedback (i.e., the articles they choose to skip without clicking in). Moreover, the framework implicitly models the user using their session start time, and the article using its initial publishing time, in what we call “neutral feedback”.

Introduction

在 BBC / CNN / Bing News 上面，有許多人是用匿名的方式或以訪客身份登錄到平台上。一般而言他們並不會在短時間內閱讀很多文章，因此形成 limited interactions，系統很難完全理解這些使用者的行為。

Traditional methods: 傳統的方法會把 News Rec 給 Reduce 成 CTR prediction task，主要透過 CF 或是 FM 來讓系統追蹤 User history，缺點是沒辦法推薦給那些沒有 User history 的 Anonymous visits & Guest logins。
Recent NN methods: 而近期的方法 (E.g. LSTUR, NRMS, NPA) 則是注重在使用 Attention mechanism 來對 News representation & User representation 做 Encode，結著把 News Rec 視為 Sequential Rec，缺點是忽略了 Click behavior & Article-to-article transition。E.g. 沒有充分利用與閱讀行為相關的 Temporal Information，這在 User interactions 很 Sparse 時尤其重要。

考量到以上這些問題，作者把 News Rec 視為一個 Session-Based Recommendation 來解決！最終目標一樣是希望預測 User 下一個感興趣的 News item，只是模型會根據 Previous sequence of behaviors within a session 來做決策，而這個 Session 通常會是我們自己訂定的一段時間 (E.g. 30 min)。此外，作者也提出 ”Implicit feedback from user behaviors” 的 Modeling，如下圖。

In session-based news reading, a user may spend different amounts of time on different clicked articles, rep- resenting different level of preference in an implicit form of positive feedback; a user’s impression of an article without an eventual click on the article represents an implicit form of negative feedback; the start time of the click and publish- ing time of articles can be viewed as neutral feedback.

Typical Implicit feedback 可以從 Browsing the main page, reading an article, closing an article, backtracking, etc 等等的方式來 Extract。作者認為透過這樣的 “Implicit Feedback” 來對 “Explicit Feedback” 建模可以對 User 有更好的推薦品質。對此，以下個問題是作者在本文中特別注重的三個面向。

If a user clicked an article, did she really like it?
If a user did not click an article, did she dislike it?
How do we model the temporal characteristics of the user and the articles in the system?

在 Traditional Recsys，Clicks 通常代表 User 的 Like / Vote，但對於 News Rec，情況就有些不同了～User 可能會被 Tricked 去點擊一篇他沒興趣的文章，點入後一旦 User 意識到就會迅速退出並切換到其他文章。因此，用戶花在閱讀文章上的時間 (Spending Time on Article) 會是比較 fine-grained 的 information 能夠來表示 User 對文章的偏好程度！而且 Spending Time on Article 是一個 Continuous value，而 Click or not 是 Binary。
User 沒有 Click 的文章並不代表他不喜歡這個文章，可能只是因為他沒被 Exposed 到而已。假設文章大致按照 Publication Time 的順序呈現給 User，我們可以推斷出哪些文章可能在指定的 Session 內對 User 產生 Impression。只有那些在他的 Impression List 中出現，但沒有被點擊的文章才被認為 User 不感興趣的，作者把這樣的資訊視為 Implicit negative feedback！
雖然 Positive and Negative feedback 有助於 Estimate User & News Article 之間的 Interaction，但一些關鍵的 Temporal information 對於個別 User 和 Artivle 的建模很有用。Session 的起始時間可以表示這個 User 的 Daily routine，並預期他在 same day of a week / same time of a day 有相同的 Reading behavior 或是 Reading background。另一方面，每篇文章的 Publication Time 也可以在一個 Session 中形成一個 Sequence，這反應了 User 對文章時效性的敏感度。因此，作者把 Session start time & Article publishing time 視為 Implicit neutral feedback！

Recap: Session Based Model

給定 Prefix sequence of the session — Su = (n1, n2, n3, …, nT) 預測出 User 在下一個時間點 T+1 最有可能點擊的 Item, nT+1。Item 的 Embedding 可以透過一個 N x dn 的 Item embedding matrix 取得，dn 是 Dimension of item embedding，xi 是 Item ni 的 Embedding vector。有了 Embedding vector 之後可以使用 GRU / RNN / Attention 等方式整合這個 Sequence 取得 xs，代表 User’s history preference。接著再把 Candidate Items Embeddings [x1, x2, x3, …, xN] 個別與 xs 做 Inner-product，且過 Softmax 取得 Similarity Score。

最後，Loss 通常會選定 Cross-Entropy Loss：

S 是整個 Training Sessions。Su 是 S 裡面的一個 Session。N 是 Candidate Items 數量。yu_j = 1 代表 article j 是 Su 的 nT+1，yu_j = 0 則否。

Proposed Model Architecture

Model architecture. Squares in the figure represent vectors, and their colors refer to the different encoders that produce them.

Part1. Base Model: Content-aware Recommendation (CAR)

以下先介紹論文的 Base Model - CAR。在 News article’ content information 的部分，會抓 News 的 Title & Metadata attributes of articles (E.g. Categories)，然後透過 Word2Vec 把這些文字資訊轉換成一個 dc-dimensional vector。接著，有了 Content vector ci of article ni 以及 Item embedding vector xi 後，把 [xi; ci] Concatenate 得到 Embedding vector of ni, xci。把 Click Sequence 轉 [xc1, xc2, …, xcT] 之後，使用 Attention Pooling to encode user’s history preference XCs。

W0: 1 x dn; W1: dn x (dn+dc); xci: (dn+dc) x 1; b0: dn x 1

其中一個要注意的是，為了得到 Input sequence 的 Sequential information，他還會加入 Positional Encoding (像是 Transformer 的做法)，到我們的 Su，然後才一起做 Attention 得到 XCs。

P.S. 這個部分很直白，就是跟 Neural News Rec 一樣，把所有點擊過的 News 轉成 News Embedding xci 表示之後，再透過一個簡單的 Attention 得到 User Embedding XCs，接著去跟 Candidate News 算 Inner-Product 得到 Score。

Part2. Modeling Time as Neutral Feedback

2–1. Active Time (Duration Time)

作者會把 Continuous 的 Active Time ti，用 Floor function 轉換成 Discrete 的 ti’，並且把他 Map 到 m 個 categories / distinct time value，每個 category 會共享同個 Active time embedding vector, tai (dt x 1)。

2–2. Click Start Time (for XCs)

Users who start reading at a similar time are more likely to share the same reading behavior, which means that user interests are influenced by the start time. 在相似時間開始閱讀的 User 更有可能共享相同的閱讀行為，意味著 User Interest 受到 Start Time 的影響。E.g. Some people tend to read financial news in the morning but instead read entertainment in the evening.

我們會把 Click Time from each click behavior of a session 記錄起來為 tsi，且只會使用 (w,h) week & hour 的 temporal data。

Embedding of click start time, i from 1 to |Su| = T

而為了 Model 在 Su 裡面每個 tsi 不同的重要性，一樣透過 Attention 機制來先得到 Preference Query (這部分跟 Transformer Query 一樣)，再用 Preference Query 來跟 Article 來做 Attention 得到他們彼此 Interaction 的重要性。

Preference query, i from 1 to |Su| = T. Wt = dn x 2dt; tsi = 2dt x 1; bt = dn x 1; qi = dn x 1.

Importance of interactions between article and preference query generated from click start time. ci 應該改為 xci (我認為作者有typo) = (dc+dn) x 1; Wt’ = (dc+dn) x dn; qi = dn x 1; αti = (dc+dn) x 1.

Get weighting parameters via softmax. |Su| = T.

最後把新得到的 αti’ 與 Part 1 Base Model 的 αi’ 做結合，重新 Define Contextual vector representation XCs。

Redefined contextual vector representation. XCs = (dn+dc) x 1.

2–3. Publish Time (for XTs)

Users’ reading habits are reflected in the sequence of publishing time 𝑡𝑝1, …, 𝑡𝑝𝑖 in 𝑆𝑢 . We can make inferences whether the user tends to browse new articles or older ones from this. 使用者的閱讀習慣反映在 𝑆𝑢 中的發佈時間 𝑡𝑝1, …, 𝑡𝑝𝑖 的序列，可以由此推斷用戶是否傾向於瀏覽新文章或舊文章。

我們會把 Publishing time embedding vector 記錄為 tpi，且會使用 (s,d,w,h,m) season, day, week, hour, minute 的 temporal data。

Publishing time embedding vector, i from 1 to |Su| = T

接著，計算 Article content vector (ci) 與 Publishing time embedding vector (tpi) 之間的 Attention Score αtp’_i。

W0’ = 1 x dn; W1' = dn x 5dt; tpi = 5dt x 1; W2' = dn x dt; ci = dc x 1; b0' = dn x 1.

最後把 Attention Score αtp’_i 跟 Publishing time embedding vector tpi 做 Weighted Sum 得到 xts。

Final temporal session representation. tpi = 5dt x 1; xts = 5dt x 1.

最後會把 XCs (dn+dc) & XTs (5dt) Concat 起來形成 Aggregated Xs。

Part3. Modeling Positive Feedback (From Active Time)

Implicit positive feedback takes the form of the active time interval that a user spent on each article after clicking on it. 如果使用者在一篇文章中停留的時間很短，很可能是因為用戶被標題給 Tricked 了，實際上他並不喜歡這篇文章。(P.S. 如果沒有明確可用的 Active Time，可以通過用戶連續兩次點擊的時間間隔來估計)

這部分，作者會把 2–1 提到的 Each degree of active time embedding tai 抓進來，跟每一篇文章的 Embedding vector xci 去算 Attention。

會把剛剛在 Part1 Base Model 計算的 αi：

αi = 1 x 1; W0: 1 x dn; W1: dn x (dn+dc); xci: (dn+dc) x 1; b0: dn x 1

轉換成新的 αi：

αi = 1 x 1; W0 = 1 x dn; W1 = dn x (dc+dn); xci = (dc+dn) x 1; W2 = dn x dt; tai = dt x 1; b0 = dn x 1

然後同樣 Weighted Sum 得到 XCs 以及 Aggregated 的 Xs。

Aggregated Xs

Part4. Modeling Negative Feedback (From Impression List)

最直觀得到 Negative Feedback 的方法就是大家常用的 Negative Sampling，但因為這樣隨機抽樣 Item 可能與 User 完全無關，對模型學習構成的挑戰太小。一個 Informative Item 應該能夠混淆模型是否發現到 More complex hidden meaning of user interaction。

對此，作者的做法是使用 Impression List Imp_u 來生出 Negative data。將 Imp_u 中的 Un-clicked news 視為 Negative signals，比起其他 Candidate items 應該要給予不同的 Loss 讓他能夠順利學習去區分不同 Candidate items，應該更嚴格地懲罰 XCs 和那些 Strongly Negative Samples 之間的 Similarity。(c.f. Contrastive Learning)

其中要注意的是，Impression List 並不一定能夠取得，因此作者假設如果一篇文章被 Published 在被 user u 點擊的文章附近，它更有可能出現在 Imp_u中。具體上的做法是：根據 Candidate News 的 Publication Time 對 Candidates 進行 Sorting，保留 Window Size 300 的 Nearby articles，並從中來進行抽樣。目標是最小化 XCs 和 Negative Sample j (𝑗 ∈ 𝑁𝑒𝑢) 的向量 XCj 之間的 Cosine Similarity Score。(Neu = Su’s negative samples set，Neu ⊆ Imp_u)

訓練上的 Loss 會把這部分，加入原本的 L1 後面，形成 L2。

1(·) returns 1 if the expression is true, 𝜆 is the weighting parameter of loss from negative articles. Jointly optimize these two losses with Adam optimizer.

Experiments

Datasets

1. Adressa

Reclab

The Adressa Dataset is a news dataset that includes news articles (in Norwegian) in connection with anonymized users…

reclab.idi.ntnu.no

2. Globo

News Portal User Interactions by Globo.com

A large dataset for news recommendations offline evaluation and analytics

www.kaggle.com

3. MIND

MIND

MIcrosoft News Dataset (MIND) is a large-scale dataset for news recommendation research. It was collected from…

msnews.github.io

Dataset statistics (after preprocessing)

P.S. Datasets 細節可以參考 Source Codes～

Metric

以下介紹除了 Hit Rate@K & NDCG@K 之外的兩個指標：(1) ILD@K (2) unEXPu@K。

Intra List Diversity (ILD@𝑘) evaluates the topical/semantic diversity in 𝑅, and reflects the model’s ability to recommend differ- ent items to the same user.

d(a,b) is a distance measure between item a and b, and d(a,b) = 1 if item a,b belong to different topics (categories), 0 otherwise.

2. Content-based unexpectedness metric (unEXP) can be used to measure this kind of unexpectedness. (We expect the system to recommend unseen items to surprise users)

Main Results

Main and ablation results (𝑘 = 20 by default in our all tables). All results are averaged over all folds. The best baseline result on each metric is marked with ∗ and overall best results are bolded. The “Ours” is our whole model and (-) means to ablate the corresponding module, where “neut”, “pos” and “neg” respectively refer to our neutral, positive and negative feedback modules. The last column is to replace our negative sampling strategy with random sampling. ↓ indicates performance drop over the whole model.

The graphical comparison of all methods on 4 different metrics and 3 different datasets.“Ours” is our approach.

Non-neural methods (CBCF and STAN) are either considering the content information or the recency of the current session, and their results are somehow comparable to deep learning methods in three datasets. However, they generate recommendation lists with low diversity / novelty, mainly because their simple algorithms cannot capture enough personalized information.
Session-based approaches (STAMP and SGNNHN) yield better performance on HR and NDCG, but not always good at ILD/unEXP, while SASRec and SRGNN recommend more diverse but less accurate, showing the trade-off between diversity and accuracy.
From the user’s aspect, though, when ILD/unEXP is over a threshold (like around 0.83), it’s hard for them to distinguish the difference, thus the ILD/unEXP score of our model is bearable.
For “Ours”, when compared with STAMP, it performs better or close on both accuracy and diversity. This result shows that our model mitigates the dilemma between accuracy and diversity to a certain extent.

P.S. In the MIND dataset, the improvement is comparatively small and the possible reasons are: on the one hand, MIND did not provide active interval, nor did they give click time of each article (just start time of a session), we cannot get positive feedback from the data; on the other hand, from the results of CBCF, we assume the article transition information is too sparse and thus it is hard to recommend. Note that this dataset is not designed for the session-based recommendation, hence some information may be inaccurate (e.g., one session may last for days, longer than 30 minutes).

Analysis

Compared with the whole model, there is a huge drop after removing neutral information and this is the most consistent over all metrics, which reveals the importance of neutral information (temporal information).
Adressa provides the most complete information, which not only releases the original text of articles instead of the extracted vectors in Globo, but also gives the accurate active time of the user in each article, while we can only estimate the active time by the interval between two successive clicks for Globo, which may not be accurate.
After removing the positive implicit feedback module, in Adressa dataset, the HR and NDCG drop by 2.4% and 5.5% respectively, while in Globo dataset, they drop by 1.9% and 3.1%. The positive information performs similarly in both Adressa and Globo datasets, implying that our approximate estimation is reasonable. Further, the positive implicit feedback is more favorable on the Adressa dataset due to the more precise information.
Negative information is less effective than positive information, especially by diversity/novelty metrics. One explanation is that the negative samples from the impression list are reconstructed based on their publishing time, so the information is not totally reliable.
Negative sampling module lowers diversity, possibly because in the dataset the negative samples and the positive article usually belong to different categories, thus adding this module forces the model to recommend similar articles to the positive one.
Negative feedback is better modeled in MIND due to its complete impression data. To verify the effect of the negative sampling strategy more accurately, we set the control group with random sampling, and we find that even though the random sampling would decrease the performance slightly, our negative feedback shows superior performance over it. The possible reason for the worse performance of using random sampling is that randomly sampled negative items have the possibility to be liked by this user, and this module imports some noise instead because this sample strategy does not consider what the user really likes.

For fig above in (a), the authors wanted to validate the assumption for the negative user feedback, which is that articles whose publishing time is close to the clicked articles are likely presented to the user, or within their impressions. For each session 𝑆𝑢 , the authors sample negative items 𝑁𝑒𝑢 using their strategy, and compute Jaccard similarity between 𝑁𝑒𝑢 and real 𝐼𝑚𝑝𝑢, the overall score is 0.0062 when |𝑁𝑒𝑢 = 100|, compared with 0.0044 when random sampling. It means their assumption is reasonable.

For fig above in (b), negative loss is useful but too many weights on it will harm the learning of the user’s positive feedback.

Different ways of utilizing temporal information in Adressa, where “p” stands for the publishing time and “s” stands for the start time

The visualization of time embedding tables for the day of a month, the day of a week, hour and minute, trained on Globo dataset.

The representation of the minutes is rather uniform and random, because news publishing and reading can happen any minute of an hour. But there are certainly more activities at certain hours during a day. There are also some irregular patterns for weekends as shown in (c).

P.S. 其他實驗分析在這篇文章省略，有興趣可以再參照原文～

Contribution

For the first time, the authors leverage the positive/negative/neutral implicit feedback in anonymous session-based news recommendation.
Designed novel representations for temporal information and incorporate it with positive and negative feedback into a deep attention network.
Comprehensive offline evaluations on three real-world datasets show the clear advantage of the proposed method in terms of overall performance on diversity, accuracy and serendipity in both normal and cold-start scenarios.

This article will be updated at any time! Thanks for your reading. If you like the content, please clip the “clap” button. You can also press the follow button to track new articles. Feel free to connect with me via LinkedIn or email.