[Notes] (KDD2018) DIN: Deep Interest Network for Click-Through Rate Prediction

19 min readFeb 10, 2022

論文連結 Paper Link

https://arxiv.org/pdf/1706.06978.pdf

摘要 Abstract

近年基於 DNN 的 CTR 模型，基本上就是 Embedding + MLP 的組合。首先將 Sparse input 映射到低維度的 Embedding 轉換到 fixed-length vectors，最後送進 MLP 學習 non-linear 特徵關係。不過這種 fixed-length vectors 會是一個瓶頸，因為他難以從 user 豐富的歷史行為有效捕捉多樣化興趣。本文作者提出 Deep Interest Network (DIN)，使用 local activation unit 來透過歷史行為學習 user interests 的 representation。同時提出兩個技術利於學習此碩大的網路結構：mini-batch aware regularization & data adaptive activation function。

模型架構 Model Architecture

Background

如上圖，對於阿里巴巴來說，廣告即為商品。上圖說明其廣告系統的運作，主要透過使用者歷史行為 User History Behaviors 並分成兩個 Stage：

(1) Match Stage：用協同過濾等方法生成與 user 相關的 candidate ads。

(2) Ranking Stage：針對剛剛生成的 candidate ads 進行 CTR Prediction 與排名。

Interests of user with rich behaviors are diverse and could be locally activated given certain ads.

Feature Representation

模型拿到的初始資料大概長得像這樣： [weekday=Friday, gender=Female,
visited_cate_ids={Bag,Book}, ad_cate_id=Book]，通常會用 One-Hot / Multi-Hot 轉換到 high-dimensional sparse vector。以數學的形式來說，encoding vector 的 i-th feature group 表示為 ti，是一個 Ki 維度的向量。Ki 是這個 feature group I 的 unique ids 數量。ti[j] 是 ti 中 j-th 個元素且 ti[j] ∈ {0,1}，假設 ti[j] 從第 1 項連加到第 Ki 項的和為 k，如果 k = 1，表示 One-Hot，如果 k > 1，表示 Multi-Hot。所以說，一個 instance 被表示如下：

K = dimensionality of the entire feature space

所以剛才文中舉例的資料會被表示成以下的樣子：

Statistics of feature sets used in the display advertising system in Alibaba. Features are composed of sparse binary vectors in the group-wise manner.

Note that in our setting, there are no combination features. We capture the interaction of features with deep neural network.

Base Model

Base Model 其實就是文中簡稱的 Embedding&MLP，主要有四個部分，以下逐一介紹。

Part1. Embedding Layer

如同之前的作法，要把高維度的稀疏向量轉換為低維度的稠密向量。對每個 feature group 的 ti，我們會有各自的 Embedding Dictionary Wi：

裡面的每個 wij 對應的是 j-th 的 D 維 Embedding Vector。

— 如果 ti 是 One-Hot，且 j-th element ti [j] = 1，他的 embedded representation 是一個 single embedding vector ei = wij。

— 如果 ti 是 Multi-Hot，且 ti [j] = 1 for j ∈ {i1, i2, …, ik}，他的 embedded representation 是一串 embedding vectors:

Part2. Pooling Later and Concat Layer

不同的 users 有不同數量的行為。因此，Multi-hot 特徵向量 ti 的 non-zero 數量因人而異，導致對應 Embedding vectors 的長度會有所不同。由於 FCN 只能處理固定長度的輸入，我們使用 Pooling Layer 轉換 Embedding vectors 得到 Fixed-Length Vector。最常用 Sum Pooling 或是 Average Pooling。

得到各自 Feature Group 的 Fixed-Length Embedding Vectors 後，我們會把他 Concatenate 起來得到最後的 Overall Representation。

Part3. Multi-Layer Perceptron (MLP)

拿到上一層生成好的 Concatenated Dense Representation Vector 後，接下來餵進去 MLP，這些 FCN 會學他們特徵彼此之間的組合與交換。

Part4. Loss

本模型使用的 Objective Function 就是 Negative Log-likelihood：

S is the training set of size N, with x as the input of the network and y ∈ {0,1} as the label, p(x) is the output of the network after the softmax layer, representing the predicted probability of sample x being clicked.

Deep Interest Model (DIN)

Base model 透過匯集 user 的所有 feature group 的 embeddings 來得到 fixed-length representation，此時無論 candidate ads 為何，這個向量的表示都不會改變。然而這樣 fixed-length representation 會無法有效表達 user interests 的多樣性。想要解決這個問題有一個很 naive 的方法是直接擴增 Embedding 的維度，但這樣會讓參數量過多且容易過度擬合。有沒有更優雅的方法可以解決問題呢？local activation characteristic of user interests 提供作者很大的啟發。我們回想上方提及的例子，模型顯示出來的廣告(displayed ads)對那位母親的歷史行為進行 soft-search，發現他最近瀏覽過類似的手提包以及皮革包包。也就是說，與 displayed ads 相關的行為對他是否點擊這個廣告有很大的貢獻。DIN 透過專注於 locally activated interests 的 representation 來模擬整個過程。對每篇候選的廣告，DIN 會考慮歷史點擊行為的相關性來算出不同的向量表示 user interest。

User interest representation varies over different candidate ads.

DIN 除了引入 local activation unit 之外，其他模型架構都跟 Base Model 一樣。Activation units 就是用於 user behavior features，他會將現在的 candidate ad A 對歷史的點擊資訊 (e1, e2, e3, …, eH) 給予 Attention Weights，在將其依照權重做 Weighted Sum Pooling 產生 vU。

{e1, e2, …, eH} is the list of embedding vectors of behaviors of user U with length of H. vA is the embedding vector of ad A. In this way, vU (A) varies over different ads. a(·) is a feed-forward network with output as the activation weight. Apart from the two input embedding vectors, a(·) adds the out product of them to feed into the subsequent network, which is an explicit knowledge to help relevance modeling.

Attention Mechanism 就是那套 Key, Query, Value (K, Q, V) 的機制，不過這邊做的 Attention 跟傳統的注意力機制有些微差異，這邊沒有「所有 attention weights 總和為 1」的限制！目的是保留 user interests 的強度。也就是說，放棄對 a(·) 的輸出使用 softmax 進行 normalization。Key 是 User History，Value 也是 User History，Query 是 Candidate Ad。

DIN Visualization

Illustration of adaptive activation in DIN. Behaviors with high relevance to candidate ad get high activation weight.

Visualization of embeddings of goods in DIN. Shape of points represents category of goods. Color of points corresponds to CTR prediction value.

Training Techniques: Part1 — Mini-batch Aware Regularization (MBA)

根據作者的實驗，如果沒有 Regularization Term 會造成模型在訓練完 1 個 Epoch 後就急速下降。但是，將傳統的 Regularization 方法 (E.g. L1 or L2) 直接應用於 Sparse Inputs 和數億參數的訓練網絡是不切實際的。為什麼呢？以 L2 Regularization 為例，在沒有 Regularization 的 SGD 中，只需要更新每個 mini-batch 中出現的 non-zero sparse feature 的參數。但當加入 L2 Regularization 時，它需要在每個 mini-batch 的所有參數上計算 L2-norm，這導致計算量過於龐大。

對此，作者提出 Mini-batch Aware Regularizer ，它只計算每個 mini-batch 中出現的 parameters of sparse features 的 L2-norm。事實上，計算上的 Bottleneck 正是 Embedding Dictionary，他貢獻了 CTR 網路的大部分參數且帶來了計算上巨大的困難。假設我們以 W 表示整個 Embedding Dictionary 的參數，共 D x K 維，D 為 Dimension of Embedding Vector，K 為 Dimension of Feature Space。

Expand the l2 regularization on W over samples. wj ∈ Dx1 vector is the j-th embedding vector, I(xj≠0) denotes if the instance x has the feature id j, and nj denotes the number of occurrence for feature id j in all samples.

wj 是 D 維的 j-th Embedding Vector，I(xj ≠ 0) 表示 instance x 是否具有 feature id j，nj 表示特徵 id j 在所有樣本中的出現次數。上方的數學式又可以被 Transform 成 mini-batch aware 的樣式：

B denotes the number of mini-batches. Bm denotes the m-th mini-batch.

B 代表 mini-batch 的數量。Bm 表示第 m 個 mini-batch。接著我們定義 αmj，他代表在 mini-batch Bm 中是否至少有一個 instance 具有 feature id j。

有了這些值後，我們可以 approximate 剛才的 L2(W)：

這就是作者定義的 approximated mini-batch aware version of L2 regularization。對於第 m 個 mini-batch，對 feature j 的 embedding weights 取的 Gradient 與更新方法可表示成：

其中只有第 m 個 mini-batch 中出現的 feature parameters 參與 regularization 的計算。

Training Techniques: Part2 — Data Adaptive Activation Function

PReLU 是一種常用的 Activation Function：

其中 s 是 activation function f (·) 的輸入的一維。p(s) = I (s > 0) 是一個 indicator function，它控制 f (s) 在 f (s) = s 以及 f (s) = αs 之間的切換，α 則是 learning parameter，這邊我們將 p(s) 稱為控制函數。

PReLU 取 0 作為他的 hard rectified point，這可能不適用於每個 layer 有不同的 Distribution 的狀況。為了改善這個問題作者提出一個 novel data adaptive activation function — Dice：

在 Training 的時候，E[s] 和 Var[s] 是每個 mini-batch 中輸入的 mean 和 variance。在 Testing 的時候，E[s] 和 Var[s] 是通過資料的 Moving Average 算出 E[s] 和 Var[s]。ε 是一個小常數，作者實驗中設置為 1/10⁸。Dice 可以看作是 PReLu 的 Generalization。Dice 的關鍵思想是根據輸入數據的分佈自適應調整 rectified point，其值設置為輸入的 mean。此外，Dice 可以平滑地控制兩個 channel 之間的切換。當 E(s) = 0 且 Var[s] = 0 時，Dice 退化為 PReLU。

P.S. 作者提出的這兩個 Mini-batch Aware Regularization & Data Adaptive Activation Function 比較抽象不直觀，希望板上有大神寫個 Post細說解惑！

實驗 Experiment

P.S. 這部分我先說聲對不起。因為這篇筆記的用意以模型架構的理解為重，因此實驗的部分僅附圖表，文字敘述省略。詳細解釋請回顧原論文～

Performances of BaseModel with different regularizations on Alibaba Dataset. Training with fine-grained Goods_ids features without regularization encounters serious overfitting after the first epoch. All the regularizations show improvement, among which our proposed mini-batch aware regularization performs best. Besides, well trained model with Goods_ids features gets higher AUC than without them. It comes from the richer information that fine-grained features contained.

Model Coparison on Amazon Dataset and MovieLens Dataset. All the lines calculate RelaImpr by comparing with BaseModel on each dataset respectively.

Best AUCs of BaseModel with different regularizations on Alibaba Dataset. All the other lines calculate RelaImpr by comparing with first line.

Model Comparison on Alibaba Dataset with full feature sets. All the lines calculate RelaImpr by comparing with BaseModel. DIN significantly outperforms all the other competitors. Besides, training DIN with our proposed mini-batch aware regularizer and Dice activation function brings further improvements.

參考資料 Reference

帶你學深度學習推薦系統 by 王喆

矽谷資深演算法大師：帶你學深度學習推薦系統(附8頁彩頁)

3.3 Deep Crossing 模型-經典的深度學習架構 3.4 NeuralCF 模型-CF 與深度學習的結合 3.6 Wide&Deep 模型-記憶能力和泛化能力的綜合 4.2 Word2vec-經典的Embedding 方法…

www.books.com.tw

2. DIN Tensorflow Codes

GitHub - zhougr1993/DeepInterestNetwork

Deep Interest Network for Click-Through Rate Prediction This code is a demo to implement DIN on Amazon data…

github.com

3. 他人論文解讀＠知乎

2018阿里ctr预估算法din论文阅读笔记

一、开源代码（Experiment code on two public datasets is available on GitHub）：二、论文目录：三、论文摘要：创新点1：…

zhuanlan.zhihu.com

4. 他人論文筆記 @ CSDN

【论文笔记】DIN: Deep Interest Network for Click-Through Rate Prediction_xzhws的博客-CSDN博客

本文记录DIN: Deep Interest Network for Click-Through Rate…

blog.csdn.net

Post Script

Attention 對 Recommender System 的啟發：
Attention 在數學形式上只是將過去的 Sum Pooling 或 Average Pooling 換成是 Weighted Sum 或 Weighted Average。這樣的機制對推薦系統有很大的正面影響。因為 Attention Score 反映人類天生的注意力機制的特點。對他做模擬，使推薦系統模型更加接近使用者真實的思考過程，進一步提升推薦效果。

This article will be updated at any time! Thanks for your reading. If you like the content, please click the “clap” button. You can also press the follow button to track new articles. Feel free to connect with me via LinkedIn or email.

[Notes] (KDD2018) DIN: Deep Interest Network for Click-Through Rate Prediction

論文連結 Paper Link

摘要 Abstract

模型架構 Model Architecture

Background

Feature Representation

Base Model

Deep Interest Model (DIN)

DIN Visualization

Training Techniques: Part1 — Mini-batch Aware Regularization (MBA)

Training Techniques: Part2 — Data Adaptive Activation Function

實驗 Experiment

參考資料 Reference

矽谷資深演算法大師：帶你學深度學習推薦系統(附8頁彩頁)

3.3 Deep Crossing 模型-經典的深度學習架構 3.4 NeuralCF 模型-CF 與深度學習的結合 3.6 Wide&Deep 模型-記憶能力和泛化能力的綜合 4.2 Word2vec-經典的Embedding 方法…

GitHub - zhougr1993/DeepInterestNetwork

Deep Interest Network for Click-Through Rate Prediction This code is a demo to implement DIN on Amazon data…

2018阿里ctr预估算法din论文阅读笔记

一、开源代码（Experiment code on two public datasets is available on GitHub）：二、论文目录：三、论文摘要：创新点1：…

【论文笔记】DIN: Deep Interest Network for Click-Through Rate Prediction_xzhws的博客-CSDN博客

本文记录DIN: Deep Interest Network for Click-Through Rate…

Post Script

Written by Haren Lin

No responses yet

[Notes] (KDD2018) DIN: Deep Interest Network for Click-Through Rate Prediction

論文連結 Paper Link

摘要 Abstract

模型架構 Model Architecture

Background

Feature Representation

Base Model

Deep Interest Model (DIN)

DIN Visualization

Training Techniques: Part1 — Mini-batch Aware Regularization (MBA)

Training Techniques: Part2 — Data Adaptive Activation Function

實驗 Experiment

參考資料 Reference

矽谷資深演算法大師：帶你學深度學習推薦系統(附8頁彩頁)

3.3 Deep Crossing 模型-經典的深度學習架構 3.4 NeuralCF 模型-CF 與深度學習的結合 3.6 Wide&Deep 模型-記憶能力和泛化能力的綜合 4.2 Word2vec-經典的Embedding 方法…

GitHub - zhougr1993/DeepInterestNetwork

Deep Interest Network for Click-Through Rate Prediction This code is a demo to implement DIN on Amazon data…

2018阿里ctr预估算法din论文阅读笔记

一、开源代码 （Experiment code on two public datasets is available on GitHub）： 二、论文目录： 三、论文摘要： 创新点1：…

【论文笔记】DIN: Deep Interest Network for Click-Through Rate Prediction_xzhws的博客-CSDN博客

本文记录DIN: Deep Interest Network for Click-Through Rate…

Post Script

Written by Haren Lin

No responses yet

一、开源代码（Experiment code on two public datasets is available on GitHub）：二、论文目录：三、论文摘要：创新点1：…