[Notes] Deep Crossing: Web-Scale Modeling without Manually Crafted Combinatorial Features

11 min readFeb 4, 2022

論文連結 Paper Link

https://www.kdd.org/kdd2016/papers/files/adf0975-shanA.pdf

摘要 Abstract

Hand-Crafted 的特徵組合一直是許多成功模型背後的秘訣。然而，對於 Web-Scale 的應用，Features 的多樣性和數量使得這些 “hand-craft” 的創建、維護和部署成本很高。本文提出了 Deep Crossing 模型，它是一種深度神經網絡，可以自動組合特徵發現隱含地重要的 Crossing Features。Deep Crossing 的輸入是一組組單獨的特徵，可以是 Dense 也可以是 Sparse。網絡由 Embedding Layer 和 Stacking Layer 以及串聯的 Residual Unit 組成。

介紹 Introduction

Deep Crossing 模型的應用場景是 Microsoft Search Engine — Bing。User 在輸入 Query 後，Sponsored Search Engine 除了 Retrieve 相關的結果回傳，還會回傳與 Query 相關的廣告(這也是大多數 Search Engine 的營利方式)。而 Deep Crossing 要最佳化的目標就是讓模型推出來的廣告 Click Through Rate (CTR) 越高越好。以下是模型所使用到的特徵：

Query: A text string a user types into the search box. 使用者給的搜尋
Keyword: A text string related to a product, specified by an advertiser to match a user query. 廣告的關鍵字
Title: The title of a sponsored advertisement (referred to as “an ad”hereafter), specified by an advertiser to capture a user’s attention. 廣告的標題
Landing page: A product’s web site a user reaches when the corresponding ad is clicked by a user. 點擊廣告後登錄的頁面
Match type: An option given to the advertiser on how closely the keyword should be matched by a user query, usually one of four kinds: exact, phrase, broad and contextual. 關鍵字與使用者的搜尋詞的匹配程度
Campaign: A set of ads that share the same settings such as budget and location targeting, often used to organize products into categories. 廣告主建立的廣告投放計畫
Impression: An instance of an ad being displayed to a user. An impression is usually logged with other information available at run-time.
Click: An indication of whether an impression was clicked by a user. A click is usually logged with other information available at the run-time.
Click through rate: Total number of clicks over total number of impressions. 廣告的點擊率
Click Prediction: A critical model of the platform that predicts the likelihood a user clicks on a given ad for a given query. 模型預測對於給定搜尋詞點擊廣告的機率

那麼上述的特徵又是如何表示傳入 Input Layer 的呢？

針對 Individual Features Xi 都表示為一個向量。對於 Text 特徵(e.g. Query, Keyword, Title)，一種選擇是將字符串轉換為 49292 維的 tri-letter gram，如同 Deep Semantic Similarity Model (DSSM) 的做法。而像是 MatchType 的 Categorical 輸入由則以 One-Hot 向量表示，例如：exact match 是 [1, 0, 0, 0]，phrase match 是 [0, 1, 0, 0]。在 Sponsored Search System 中，通常會有數百萬個 Campaigns，如果我們直接把它轉成 One-Hot，會使得模型大小顯著增加。為了解決這個問題作者提出方法，使用表中舉例說明的一對伴隨特徵 (Companion Feature)，其中 CampaignID 是以 One-Hot 表示，僅包含點擊次數最多的前 10,000 個 Campaign。第 10000 個 slot (index 從 0 開始)保留給所有剩下的 Campaign。CampaignIDCount 涵蓋了其他活動，這是一個 numerical feature，用於存儲每個活動的統計信息，例如點擊率。

對於資料輸入的處理細節，我有看但不是很懂QQ，可能要對應程式碼會清楚一些～以下附上原文：There are usually millions of campaigns in a sponsored search system. Simply converting campaign ids into a one-hot vector would significantly increase the size of the model. One solution is to use a pair of companion features as exemplified in the table, where CampaignID is a one-hot representation consisting only of the top 10,000 campaigns with the highest number of clicks. The 10, 000th slot (index starts from 0) is saved for all the remaining campaigns. Other campaigns are covered by CampaignIDCount, which is a numerical feature that stores per campaign statistics such as click through rate.

Deep Crossing 的模型主要分為四層：Embedding + Stacking + Multiple Residual Units + Scoring。而目標函數在文中的設定為 log loss，但也可以客製化替換為 softmax 或是其他函數。

i = index to training examples; N = # of training samples; yi = per sample label; pi = output prediction

[1] Embedding Layer：將稀疏的 Categorical Features 轉換成稠密的向量 (Dense Vector)，我們稱之為 Embedding。通常 Embedding Vector 的維度會遠小於原始特徵向量的維度，而這邊即用一個簡單的 Single Layer Neural Network 來生成 Embedding，並且用上 ReLU 作為 Embedding Layer’s Activation Function。

[2] Stacking Layer：把不同的類別型資料通過 Embedding Layer 的結果，與各種 Numerical Features 全部連接在一起，製作出一個包含所有 Features 的特徵向量。

[3] Multiple Residual Unit (MRU)：這部分的網路結構主要由多層的殘差網路組成，使用 Residual Units 來協助特徵向量的各個維度進行充分的特徵交換與組合，使模型能抓到更多非線性特徵與組合的資訊。

Deep Crossing 簡單的修改了 Residual Units，不用 CNN-Kernel。Residual Unit 的獨特之處有兩個：(1) 它是將原輸入特徵通過兩層以 ReLU 為Activation Function 的全連接層後，生成輸出向量。(2) 輸入可以通過一個短路通路直接與輸出向量進行元素和操作，生成最終的輸出向量。在這樣的結構下，Residual Unit 的兩層 ReLU 網絡其實擬合的是輸出和輸入之間的殘差( Xo - XI)。

Residual Network 的誕生主要解決了兩個問題：
(1) NN 是不是越深越好？對於傳統的基於 Perceptron 的 NN (MLP)，當網絡加深之後，往往存在 Overfitting，而在 Residual Network 中，可以越過兩層 ReLU 網絡，減少 Overfitting 的發生。
(2) 當 NN 夠深時，往往存在嚴重的 Gradient Vanishing 梯度消失。梯度消失現像是指在梯度反向傳播過程中，越靠近輸入端，梯度的幅度越小，參數收斂的速度越慢。Residual Unit 使用 ReLU 取代原來的 sigmoid，且輸入向量短路相當於直接把梯度毫無變化地傳遞到下一層，這也使 Residual Network 的收斂速度更快。

[4] Scoring Layer：簡單來說這層就是要 fit 我們的 target。對於 CTR 這種二分類問題，通常採用 Logistic Regression，如果是多分類的話則會使用 Softmax。

總結以上來說，Deep Crossing 主要貢獻為：
[Key1] 將稀疏的向量稠密化
[Key2] 將特徵自動的交換與組合
[Key3] 將輸出層要解決的問題設定為最佳化目標

p.s. Experiment 於本筆記省略！

結論 Conclusion

Deep Crossing 模型中沒有任何人工 feature engineering，原始特徵經過 embedding layer 後輸入 deep network，把全部的特徵交換的工作交給模型處理。相比於之前的 FM、FFM 只具備了二階的特徵交換能力，Deep Crossing 可以實現「深度交換」，不會只限制在一階、二階，這也是 Deep Crossing 名稱的由來。

參考資料 Reference

1. 帶你學深度學習推薦系統 by 王喆

矽谷資深演算法大師：帶你學深度學習推薦系統(附8頁彩頁)

3.3 Deep Crossing 模型-經典的深度學習架構 3.4 NeuralCF 模型-CF 與深度學習的結合 3.6 Wide&Deep 模型-記憶能力和泛化能力的綜合 4.2 Word2vec-經典的Embedding 方法…

www.books.com.tw

2. 他人知乎文章

[论文笔记]Deep Crossing模型原理

论文不是我写的......书也不是我写的......我只是论文的搬运工......专栏简介都写的水硕的学习笔记，有问题你问我，我大概率也不清楚，我太菜了...... 本文参考了王喆大佬的《深度学习推荐系统》一书3.3节[1]…

zhuanlan.zhihu.com

This article will be updated at any time! Thanks for your reading. If you like the content, please click the “clap” button. You can also press the follow button to track new articles. Feel free to connect with me via LinkedIn or email.