10
Automated Embedding Size Search in Deep Recommender Systems Haochen Liu Michigan State University [email protected] Xiangyu Zhao Michigan State University [email protected] Chong Wang Bytedance [email protected] Xiaobing Liu Bytedance [email protected] Jiliang Tang Michigan State University [email protected] ABSTRACT Deep recommender systems have achieved promising performance on real-world recommendation tasks. They typically represent users and items in a low-dimensional embedding space and then feed the embeddings into the following deep network structures for predic- tion. Traditional deep recommender models often adopt uniform and fixed embedding sizes for all the users and items. However, such design is not optimal in terms of not only the recommendation performance and but also the space complexity. In this paper, we propose to dynamically search the embedding sizes for different users and items and introduce a novel embedding size adjustment policy network (ESAPN). ESAPN serves as an automated reinforce- ment learning agent to adaptively search appropriate embedding sizes for users and items. Different from existing works, our model performs hard selection on different embedding sizes, which leads to a more accurate selection and decreases the storage space. We evaluate our model under the streaming setting on two real-world benchmark datasets. The results show that our proposed framework outperforms representative baselines. Moreover, our framework is demonstrated to be robust to the cold-start problem and reduce memory consumption by around 40%-90%. The implementation of the model is released 1 . CCS CONCEPTS Information systems Recommender systems; Comput- ing methodologies Reinforcement learning; Neural networks. KEYWORDS Recommender System; AutoML; Embedding ACM Reference Format: Haochen Liu, Xiangyu Zhao, Chong Wang, Xiaobing Liu, and Jiliang Tang. 2020. Automated Embedding Size Search in Deep Recommender Systems. In Both authors contributed equally to this research. 1 https://github.com/zgahhblhc/ESAPN Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. SIGIR ’20, July 25–30, 2020, Virtual Event, China © 2020 Association for Computing Machinery. ACM ISBN 978-1-4503-8016-4/20/07. . . $15.00 https://doi.org/10.1145/3397271.3401436 Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’20), July 25–30, 2020, Virtual Event, China. ACM, New York, NY, USA, 10 pages. https://doi.org/10.1145/ 3397271.3401436 1 INTRODUCTION Recommender systems have been widely deployed in various com- mercial scenarios from e-commerce websites, music and video plat- forms to social media [11]. They model the user-item relationships and seek to predict users’ preference for items [18]. Recently, deep learning based recommender models [14, 24] have attracted con- siderable interests from the research community thanks to their excellent ability to learn the feature representations of users and items as well as modeling the non-linear relationships between users and items [26]. Deep recommender systems consist of two key components: a representation layer and an inference layer. The former learns to map discrete user and item identifiers into real-value embedding representations and the latter takes the em- bedding representations as input to calculate the prediction results for specific recommendation tasks. A lot of endeavors have been conducted on designing various network architectures for the infer- ence layer to improve the performances of recommender systems while the representation layer remains not well-studied. However, representation learning plays an important role in the success of recommender systems [4, 15, 17]. Learned embedding representa- tions can effectively represent categorical features like user and item identifiers in a low-dimensional embedding space. They pre- serve the valuable features of the users and items. With them, recommender systems can make a more accurate inference and then boost their performance. Meanwhile, as a part of the whole recommendation framework, the embedding representations are the key model parameters. They are learned jointly with other parts to better adapt to specific recommendation tasks and even show more important impacts on model performance than other parts [7]. Hence, designing a suitable representation layer is important for deep recommender systems. Traditional deep recommender systems adopt uniform and fixed embedding sizes for all users and items in the representation layer. However, this design could be not optimal for real-world recom- mendation problems. We propose to dynamically search the embed- ding sizes for different users and items. The reasons are as follows. First, in a recommendation problem, different users and items have highly varied frequencies. For users/items with high frequency, high-dimensional embeddings can achieve better performances Industry (SIRIP) Papers II SIGIR ’20, July 25–30, 2020, Virtual Event, China 2307

Automated Embedding Size Search in Deep Recommender Systemszhaoxi35/paper/sigir2020.pdf · Recommender systems have been widely deployed in various com-mercial scenarios from e-commerce

  • Upload
    others

  • View
    10

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Automated Embedding Size Search in Deep Recommender Systemszhaoxi35/paper/sigir2020.pdf · Recommender systems have been widely deployed in various com-mercial scenarios from e-commerce

Automated Embedding Size Searchin Deep Recommender Systems

Haochen Liu∗Michigan State University

[email protected]

Xiangyu Zhao∗Michigan State University

[email protected]

Chong WangBytedance

[email protected]

Xiaobing LiuBytedance

[email protected]

Jiliang TangMichigan State University

[email protected]

ABSTRACTDeep recommender systems have achieved promising performanceon real-world recommendation tasks. They typically represent usersand items in a low-dimensional embedding space and then feed theembeddings into the following deep network structures for predic-tion. Traditional deep recommender models often adopt uniformand fixed embedding sizes for all the users and items. However,such design is not optimal in terms of not only the recommendationperformance and but also the space complexity. In this paper, wepropose to dynamically search the embedding sizes for differentusers and items and introduce a novel embedding size adjustmentpolicy network (ESAPN). ESAPN serves as an automated reinforce-ment learning agent to adaptively search appropriate embeddingsizes for users and items. Different from existing works, our modelperforms hard selection on different embedding sizes, which leadsto a more accurate selection and decreases the storage space. Weevaluate our model under the streaming setting on two real-worldbenchmark datasets. The results show that our proposed frameworkoutperforms representative baselines. Moreover, our frameworkis demonstrated to be robust to the cold-start problem and reducememory consumption by around 40%-90%. The implementation ofthe model is released1.

CCS CONCEPTS• Information systems→ Recommender systems; • Comput-ing methodologies→ Reinforcement learning; Neural networks.

KEYWORDSRecommender System; AutoML; Embedding

ACM Reference Format:Haochen Liu, Xiangyu Zhao, Chong Wang, Xiaobing Liu, and Jiliang Tang.2020. Automated Embedding Size Search in Deep Recommender Systems. In

∗Both authors contributed equally to this research.1https://github.com/zgahhblhc/ESAPN

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected] ’20, July 25–30, 2020, Virtual Event, China© 2020 Association for Computing Machinery.ACM ISBN 978-1-4503-8016-4/20/07. . . $15.00https://doi.org/10.1145/3397271.3401436

Proceedings of the 43rd International ACM SIGIR Conference on Research andDevelopment in Information Retrieval (SIGIR ’20), July 25–30, 2020, VirtualEvent, China. ACM, New York, NY, USA, 10 pages. https://doi.org/10.1145/3397271.3401436

1 INTRODUCTIONRecommender systems have been widely deployed in various com-mercial scenarios from e-commerce websites, music and video plat-forms to social media [11]. They model the user-item relationshipsand seek to predict users’ preference for items [18]. Recently, deeplearning based recommender models [14, 24] have attracted con-siderable interests from the research community thanks to theirexcellent ability to learn the feature representations of users anditems as well as modeling the non-linear relationships betweenusers and items [26]. Deep recommender systems consist of twokey components: a representation layer and an inference layer.The former learns to map discrete user and item identifiers intoreal-value embedding representations and the latter takes the em-bedding representations as input to calculate the prediction resultsfor specific recommendation tasks. A lot of endeavors have beenconducted on designing various network architectures for the infer-ence layer to improve the performances of recommender systemswhile the representation layer remains not well-studied. However,representation learning plays an important role in the success ofrecommender systems [4, 15, 17]. Learned embedding representa-tions can effectively represent categorical features like user anditem identifiers in a low-dimensional embedding space. They pre-serve the valuable features of the users and items. With them,recommender systems can make a more accurate inference andthen boost their performance. Meanwhile, as a part of the wholerecommendation framework, the embedding representations arethe key model parameters. They are learned jointly with other partsto better adapt to specific recommendation tasks and even showmore important impacts on model performance than other parts [7].Hence, designing a suitable representation layer is important fordeep recommender systems.

Traditional deep recommender systems adopt uniform and fixedembedding sizes for all users and items in the representation layer.However, this design could be not optimal for real-world recom-mendation problems. We propose to dynamically search the embed-ding sizes for different users and items. The reasons are as follows.First, in a recommendation problem, different users and items havehighly varied frequencies. For users/items with high frequency,high-dimensional embeddings can achieve better performances

Industry (SIRIP) Papers II SIGIR ’20, July 25–30, 2020, Virtual Event, China

2307

Page 2: Automated Embedding Size Search in Deep Recommender Systemszhaoxi35/paper/sigir2020.pdf · Recommender systems have been widely deployed in various com-mercial scenarios from e-commerce

📕

Policy Network

Embedding Table

📙…… user k

📗 item kuser k+1

📘 item k+1……

📕 📗 📘……users

items

📗

emb size(uk)

emb size(ik)userk

itemk

Recommenderprediction

user 1item 1

user 2item 2

Reward Function

userk itemkemb emb

Figure 1: An illustration of the overview of the proposed framework.

since they provide sufficient parameters and capacities to encodethe features of the users/items [28]. Moreover, low-dimensionalembeddings are more suitable for users/items with low frequency.With little data, low-dimensional embeddings can be well-trainedbut high-dimensional embeddings may lead to overfitting due toover-parameterization [9]. Thus, uniform embedding sizes for usersand items limit the performance of the model. Second, in real-worldrecommendations, the frequency of a user/item changes dynami-cally. The embedding size of a user/item should be adjusted alongwith the change of its frequency so that its features can be encodedmore effectively. Fixed-size embeddings are incompetent to dealwith users and items whose frequencies change dynamically. Third,given that there are typically a huge number of users and itemsin a recommendation task, using embeddings with uniform andfixed dimensions for all users/items needs much memory [9]. Byselecting appropriate embedding sizes for different users and items,we can save the storage space while maintaining the performanceof the model.

To dynamically search the embedding sizes for different usersand items, we face the following challenges. First, a recommendersystem often has a massive amount of users and items, thus it’sdifficult if possible to search the embedding size for each of themmanually. Second, although AutoML methods such as differentiablearchitecture search (DARTS) [12] have been proposed to automati-cally search neural network architectures, they cannot be applieddirectly since dynamically searching the embedding sizes is a non-differentiable operation, which cannot be directly learned by gradi-ent backpropagation. Moreover, although some previous works [28]tried to make this operation differentiable by using a soft selectionof embeddings of different sizes, they will increase the storage space

and fail to select one embedding size while eliminating the influ-ence of other embedding sizes. To address the above challenges, inthis work, we propose a novel Embedding Size Adjustment PolicyNetwork (ESAPN), which serves as an automated reinforcementlearning (RL) agent to dynamically search the embedding sizes forusers and items in deep recommender systems under the streamingscenario. In the policy network, we apply hard selection on differentembedding sizes so that we can accurately select one embeddingsize at each time. Accordingly, we also propose an optimizationmethod to update the whole framework, i.e., the policy network andthe recommender system. We conduct extensive experiments ontwo public recommendation datasets and the results show that ourproposed framework outperforms several competitive baselines.Besides, our framework is shown to perform better in the cold startproblem and reduce storage memory significantly.

The remainder of this paper is organized as follows. First, wewill detail the proposed framework, including the basic recommen-dation model and the embedding size adjustment policy network inSection 2. Then, we carry out our experimental setups and resultswith discussions in Section 3. Next, in Section 4, we review relatedwork. Finally, Section 5 concludes the work with possible futureresearch directions.

2 THE PROPOSED FRAMEWORKIn this section, we will detail our proposed framework that dy-namically adjusts the embedding size of users and items in thestreaming recommendation. We will first illustrate the overview ofthe whole framework, then introduce the deep recommendationmodel, and next present the policy network for adjusting embed-ding size. Afterward, we detail the method to optimize our proposedframework.

Industry (SIRIP) Papers II SIGIR ’20, July 25–30, 2020, Virtual Event, China

2308

Page 3: Automated Embedding Size Search in Deep Recommender Systemszhaoxi35/paper/sigir2020.pdf · Recommender systems have been widely deployed in various com-mercial scenarios from e-commerce

Deep Recommendation Model

useremb

itememb

Hidden layer 1

Hidden layer m…

Sigmoid/Softm

axprediction

Multilayer Perceptron

Linear transform

Concatenation

Figure 2: An illustration of the deep recommendation model.

2.1 OverviewIn this work, we investigate the task of streaming recommenda-tion. In the streaming setting, user-item transactions are collectedcontinuously from a steady stream [2]. Given each mini-batch ofuser-item transactions, the recommendation model first makes pre-dictions of the users’ preferences for items. The prediction resultsare recorded for evaluation. Then, based on the ground-truth labels,the recommendation model is optimized by minimizing the pre-diction errors. Thus, a streaming recommendation model conductsinference and training alternatively on each mini-batch of datathat appear in the stream, which is different from the traditionalrecommendation settings where data is explicitly split to trainingand test sets. In the streaming setting, the frequencies of users anditems change dynamically over time, so as discussed in Section 1,it is desired to find a way to automatically adjust the embeddingsizes of the users and items accordingly.

Our framework consists of two core components: (i) the deep rec-ommendation model and (ii) the embedding size adjustment policynetwork. The deep recommendation model takes user-item pairsas input to predict the user’s preference for the item. The policynetwork serves as an RL agent to adaptively adjust the embeddingsizes of the users and items throughout the streaming recommen-dation process so that the recommendation model can be bettertrained and its performance will be boosted. An overview of theproposed framework is illustrated in Figure 1. As shown in thefigure, given the 𝑘-th user-item pair in the stream, the policy net-work first selects the appropriate embedding sizes of both of them.Next, we look up the embedding table to obtain the embeddings ofthe user and the item with proper sizes, respectively. Afterward,the embeddings are fed into the deep recommendation model tocalculate the prediction results. Combined with the ground-truthlabel, the recommendation model will be updated. What’s more,the predictions will pass through a reward function to compute areward, which will serve as a signal to update the policy network.Next, we will detail the deep recommendation model and the policynetwork.

2.2 Deep Recommendation ModelThe deep recommendation model is illustrated in Figure 2. Givena pair of user 𝑢 and item 𝑖 , the model predicts the preference ofthe user for the item, which can be a binary 0-1 label indicatingwhether the user will purchase it or not or a plural categorical ratingmeasuring the degree of preference. We investigate both two casesin this work. We refer to the former as the binary classification taskand the latter as the multiclass classification task.

Suppose we have 𝑛 candidate embedding sizes for both of usersand items 𝐷 = {𝑑1, 𝑑2, . . . , 𝑑𝑛} where 𝑑1 < 𝑑2 < · · · < 𝑑𝑛

2. Giventhe user-item pair (𝑢, 𝑖), the policy network adaptively chooses the𝑗 (𝑢) -th and the 𝑗 (𝑖) -th embedding sizes for the user and the item,respectively. From the embedding table, we get the embeddingsof the user e(u) with dimension 𝑑 𝑗 (𝑢) and the embedding of theitem e(i) with dimension 𝑑 𝑗 (𝑖 ) . Since the input dimension of thefollowing inference layer is fixed, we need to unify the sizes of theinput embeddings. Thus, we first define a series of linear transfor-mations. The transformations map embeddings into the adjacentlarger dimension (e.g. 𝑑1 → 𝑑2, 𝑑2 → 𝑑3, etc.):

e2 =𝑊1→2e1 + 𝑏1→2

e3 =𝑊2→3e2 + 𝑏2→3 (1)· · ·

en =𝑊𝑛−1→𝑛en−1 + 𝑏𝑛−1→𝑛

where ek is any embedding with dimension 𝑑𝑘 (𝑘 = 1, 2, . . . , 𝑛).𝑊𝑘−1→𝑘 and 𝑏𝑘−1→𝑘 are learnable weight and bias parameters.Note that we have two different sets of weight and bias parametersfor users and items. For simplicity, we don’t explicitly distinguishthem. In the implementation, the inference layer requires user/itemembeddings of size 𝑑𝑛 ; thus given user embedding e(u) ∈ R𝑑 𝑗 (𝑢)

and item embedding e(i) ∈ R𝑑 𝑗 (𝑖 ) , we map them into the largestdimension 𝑑𝑛 through consecutive linear transformations to get

2Users and items can have two different sets of embedding sizes. For simplicity, weassume that they share the same set in this work.

Industry (SIRIP) Papers II SIGIR ’20, July 25–30, 2020, Virtual Event, China

2309

Page 4: Automated Embedding Size Search in Deep Recommender Systemszhaoxi35/paper/sigir2020.pdf · Recommender systems have been widely deployed in various com-mercial scenarios from e-commerce

Policy Network

frequency

current size

Hidden layer 1

Hidden layer m

Softmax

enlarge

unchange

Multilayer Perceptron

Figure 3: An illustration of the embedding size adjustment policy network.

the transformed embeddings e(u) ∈ R𝑑𝑛 and e(i) ∈ R𝑑𝑛 :

e(u/i)×𝑊

𝑗 (𝑢/𝑖 )→𝑗 (𝑢/𝑖 ) +1−−−−−−−−−−−−−→+𝑏

𝑗 (𝑢/𝑖 )→𝑗 (𝑢/𝑖 ) +1

· · · ×𝑊𝑛−1→𝑛−−−−−−−→+𝑏𝑛−1→𝑛

e(u/i) ∈ R𝑑𝑛

Following [28], we perform batch normalization with Tanh acti-vation function on the transformed embeddings. Next, the concate-nation of them [e(u) : e(i) ] are fed into the inference layer, whichcomprises multilayer perceptrons (MLP) with𝑚 hidden layers:

ℎ1 = tanh(𝑊1 [e(u) : e(i) ]) + 𝑏1ℎ2 = tanh(𝑊2ℎ1) + 𝑏2· · ·

ℎ𝑚 = tanh(𝑊𝑚ℎ𝑚−1) + 𝑏𝑚The hidden state of the last layer ℎ𝑚 is then used to calculate the

prediction results through an activation function, which dependson the specific task. For the binary classification task, it will be theSigmoid function while it is chosen to be the Softmax function forthe multiclass classification task.

2.3 The Policy NetworkThe policy network serves as an RL agent that dynamically adjuststhe embedding sizes of users and items. It maintains the currentembedding sizes of all the users and items. Given the state of auser or an item, it takes an action to adjust the embedding sizefollowing a policy. It interacts with the environment by feeding theadjusted embeddings into the recommendation model and receivesthe corresponding reward based on the prediction. Afterward, thepolicy is updated to maximize the reward. Next, we will introducethe environment, state, policy, action, and reward in detail.

2.3.1 Environment. The environment is the deep recommendationmodel. Given a pair of user and item along with their adjustedembedding sizes, it returns the prediction result.

2.3.2 State. The state 𝑠 = (𝑓 , 𝑒) is the combination of the frequency𝑓 and the current embedding size 𝑒 of a given user or item. Basedon the frequency and the current embedding size of the user/item,the policy network decides how to adjust the embedding size.

2.3.3 Policy and Action. We have two policy networks of the samestructure for users and items, respectively. Next, we take the userpolicy network as an example to illustrate the policy network inFigure 3. The user policy network takes the state 𝑠 (the frequencyand the current embedding size of a user) as input to predict anaction 𝑎 indicating how we should adjust its embedding size. Giventhe frequency and the current embedding size of a user, we firstmap the frequency (an integer value) into an embedding represen-tation and convert the current embedding size into a one-hot vectorwith dimension 𝑛, where 𝑛 is the number of candidate embeddingsizes. Afterward, these two vectors are concatenated and fed intomultilayer perceptrons (MLP) with 𝑚 hidden layers. Finally, thelast hidden state ℎ𝑚 passes through a Softmax layer to predict theprobabilities of choosing two actions: (i) enlarge and (ii) unchange.“Enlarge” means to enlarge the embedding size of the given userto the next larger candidate embedding size, such as from 𝑑𝑘 to𝑑𝑘+1, while “unchange” indicates to keep the current embeddingsize. The reason why we only choose to increase and keep theembedding dimensions is that the frequencies of users are alwaysmonotonically increasing in the data stream. We only allow theembedding size to change to the adjacent candidate size ratherthan even larger ones at each time since the frequencies of userschange gradually and the change of embedding sizes should beconsistent with this fashion. If we choose to enlarge the embeddingsize of a given user, we initialize its embedding with the next largersize by performing the learned linear transformation in Eq 1. Forexample, when we choose to enlarge the embedding size of a userfrom 𝑑𝑘 to 𝑑𝑘+1, the new embedding ek+1 is initialized from the oldembedding ek through ek+1 = 𝑊𝑘→𝑘+1ek + 𝑏𝑘→𝑘+1. In this way,the information in the learned old embedding can be inherited andthe new embedding is not initialized from scratch.

Advantages of Hard Selection. The work [28] changes theembedding sizes of users/items by adjusting the weights of variouscandidate embedding sizes and calculate the embedding to use as aweighted sum of all the candidate embeddings at each time, whichis known as soft selection. In contrast, our method selects the em-bedding of one size at each time, which is known as hard selection.Hard selection has two benefits. First, the hard selection focuseson the embedding of one size at each time and can eliminate theinterference of the redundant information from the embeddingsof other sizes. Adopting embeddings of one size at each time also

Industry (SIRIP) Papers II SIGIR ’20, July 25–30, 2020, Virtual Event, China

2310

Page 5: Automated Embedding Size Search in Deep Recommender Systemszhaoxi35/paper/sigir2020.pdf · Recommender systems have been widely deployed in various com-mercial scenarios from e-commerce

makes the training more efficient and leads to better performance,which will be later verified in Section 3.6. Second, hard selectionrequires less memory. Soft selection needs to maintain the embed-dings of all the candidate sizes while hard selection only needs todynamically maintain the embeddings of the current size and oncewe decide to enlarge the size of an embedding, the old embeddingis abandoned. It will be later demonstrated in Section 3.7 that ourmethod is significantly more efficient in terms of memory.

2.3.4 Reward. We choose suitable embedding sizes for users anditems to boost the performance of the recommendation model. Sothe policy network should be optimized by maximizing a rewardthat reflects the prediction ability of the recommendation model.We define the rewards based on the errors of prediction results madeby the recommendation model. Given a transaction between user 𝑢and item 𝑖 and their selected embedding sizes, the recommendationmodel makes predictions and returns a loss 𝐿 based on the ground-truth label. The loss functions are defined later in Section 2.4.1for the binary and multiclass classification tasks, respectively. Weuse two first-in-first-out (FIFO) queues to maintain the latest 𝑇prediction losses 𝐿 (𝑢) = (𝐿 (𝑢)1 , . . . , 𝐿

(𝑢)𝑇) for the user 𝑢 and 𝐿 (𝑖) =

(𝐿 (𝑖)1 , . . . , 𝐿(𝑖)𝑇) for the item 𝑖 respectively, where 𝐿 (𝑢/𝑖)𝑡 indicates

the 𝑡-th to the last prediction loss on the user 𝑢 / the item 𝑖 . Giventhe current prediction loss 𝐿 of the user𝑢 and the item 𝑖 , the rewardsfor the user policy network and the item policy network are definedas the difference between the average of the last𝑇 prediction lossesand the current loss, respectively:

𝑅 (𝑢) =1𝑇

𝑇∑𝑡=1

𝐿(𝑢)𝑡 − 𝐿 (2)

𝑅 (𝑖) =1𝑇

𝑇∑𝑡=1

𝐿(𝑖)𝑡 − 𝐿

The reward can be viewed as the reduction of the current predic-tion loss compared to previous losses. Since the prediction lossesof one user can be highly varied when it interacts with differentitems and vice versa, we use the average of the last 𝑇 losses as thebase instead of the last one loss.

2.4 An Optimization MethodIn this subsection, we will introduce how to optimize the deep rec-ommendation model and the policy network, respectively. Next, wewill detail how to train the whole framework under the streamingsetting.

2.4.1 Deep Recommendation Model. Suppose that the deep rec-ommendation model is parameterized by Θ. It is optimized in asupervised manner by minimizing the difference between predic-tion results and ground truths. In the binary classification task, themodel is trained by minimizing the mean squared error (MSE) loss.Given a mini-batch of user-item transactions {𝑢𝑘 , 𝑖𝑘 }𝑁𝑘=1 along withthe corresponding ground truth labels {𝑦𝑘 }𝑁𝑘=1, where 𝑦𝑘 is 0 or 1,the recommendation model predicts the probabilities {𝑦𝑘 }𝑁𝑘=1 of

the labels to be 1. The MSE loss is defined as:

𝑀𝑆𝐸 (Θ) = 1𝑁

𝑁∑𝑘=1(𝑦𝑘 − 𝑦𝑘 )2 (3)

where 𝑁 is the size of a mini-batch.In the multiclass classification task, the recommendation is op-

timized on the cross-entropy (CE) loss. Suppose {𝑐𝑘 }𝑁𝑘=1 are theground-truth labels of the mini-batch {𝑢𝑘 , 𝑖𝑘 }𝑁𝑘=1. The CE loss isdefined as

𝐶𝐸 (Θ) = − 1𝑁

𝑁∑𝑘=1

𝑀−1∑𝑚=0

𝑦𝑘𝑚 log𝑝𝑘𝑚 (4)

where 𝑁 is the size of a mini-batch, 𝑀 is the number of classesand 𝑦𝑘𝑚 is 1 when (𝑢𝑘 , 𝑖𝑘 ) belongs to class𝑚 and 0 otherwise. 𝑝𝑘𝑚indicates the probability predicted by the recommendation modelthat (𝑢𝑘 , 𝑖𝑘 ) belongs to class𝑚.

2.4.2 Policy Network. Suppose the user policy network and theitem policy network are parameterized by Φ(𝑢) and Φ(𝑖) , respec-tively. As discussed in Section 2.3, we formulate the optimization ofthe policy network as a reinforcement learning problem. We takethe user case as an example and omit the superscript (𝑢) for simplic-ity. With the reward function defined in Eq 2, the objective functionthat the policy network aims to maximize can be formulated as:

𝐽 (Φ) = E𝑎∼Φ(𝑎 |𝑠)𝑅(𝑎 |𝑠) (5)

where 𝑠 is the state and 𝑎 is the action. In practice, it’s not easy toobtain the accurate value of 𝐽 (Φ) in Eq 5. Various methods havebeen proposed to estimate its value and gradient. We adopt Monte-Carlo sampling to estimate ∇Φ 𝐽 (Φ) and apply the REINFORCEalgorithm [23] to optimize the objective in Eq 5. Specifically,

∇Φ 𝐽 (Φ) =∑𝑎

𝑅(𝑎 |𝑠)∇Φ(𝑎 |𝑠) (6)

=∑𝑎

𝑅(𝑎 |𝑠)Φ(𝑎 |𝑠)∇ logΦ(𝑎 |𝑠)

= E𝑎∼Φ(𝑎 |𝑠) [𝑅(𝑎 |𝑠)∇ logΦ(𝑎 |𝑠)]

≈ 1𝑁

𝑁∑𝑖=1

𝑅(𝑎 |𝑠)∇ logΦ(𝑎 |𝑠)

where 𝑁 is the number of samples. In our implementation, we setsample number 𝑁 = 1 to improve the computational efficiency.With the obtained gradient ∇Φ 𝐽 (Φ), the parameters of the policynetwork can be updated as follows:

Φ← Φ + 𝛼𝑃∇Φ 𝐽 (Φ) (7)

where 𝛼𝑃 is the learning rate.

2.4.3 The Overall Optimization Algorithm. In this subsection, wedetail our AutoML-based algorithm for optimizing the whole frame-work. The algorithm is presented in Algorithm 1. In the streamingrecommendation, user-item transaction data are collected from asteady stream 𝑆 . We iterate batches of steady data from the streamto optimize the framework. Inspired by ENAS [16], in each itera-tion, there are two stages: (i) the policy network optimization stage(lines 2-6), where we update the policy network Φ using sampled

Industry (SIRIP) Papers II SIGIR ’20, July 25–30, 2020, Virtual Event, China

2311

Page 6: Automated Embedding Size Search in Deep Recommender Systemszhaoxi35/paper/sigir2020.pdf · Recommender systems have been widely deployed in various com-mercial scenarios from e-commerce

Algorithm 1: An optimization method for the wholeframework.Input: User-item transaction data stream S = {𝐷1, . . . , 𝐷𝑁 }

which consists of 𝑁 batch of data𝐷𝑖 = {(𝑢𝑘 , 𝑖𝑘 , 𝑦𝑘 )}𝑛𝑘=1, initial recommendationmodel Θ and policy network Φ, hyper-parameters𝛼𝑃 and 𝛼𝑅 .

Output: a well-trained recommendation model Θ∗ and awell-trained policy network Φ∗

1 repeat2 Sample a validation batch 𝑉 = {(𝑢𝑘 , 𝑖𝑘 , 𝑦𝑘 )}𝑛𝑘=1 from the

history transaction data3 Sample action 𝑎𝑉 ∼ Φ(·|𝑠𝑉 )4 Temporarily adjust the embedding sizes of (𝑢𝑘 , 𝑖𝑘 )𝑛𝑘=1

according to 𝑎𝑉5 Input (𝑢𝑘 , 𝑖𝑘 )𝑛𝑘=1 into the recommendation model Θ and

calculate the rewards by Eq. 26 Update the policy network Φ by Eq. 77 Collect the current batch of the transaction data

𝐷 = {(𝑢𝑘 , 𝑖𝑘 , 𝑦𝑘 )}𝑛𝑘=1 in the stream S8 Select action 𝑎𝐷 from Φ(·|𝑠𝐷 )9 Permanently adjust the embedding sizes of (𝑢𝑘 , 𝑖𝑘 )𝑛𝑘=1

according to 𝑎𝐷10 Input (𝑢𝑘 , 𝑖𝑘 )𝑛𝑘=1 into the recommendation model Θ and

update Θ by Eq. 3 or Eq. 411 until The recommendation model Θ and the policy network Φ

converge.;

validation data and (ii) the model optimization stage (lines 7-10)where the recommendation model Θ is updated using the trainingdata from the stream.

Specifically, in the policy network optimization stage, we firstsample a batch of validation data 𝑉 from the previous stream data,i.e., the history data (line 2). Based on the states of the users anditems (𝑢𝑘 , 𝑖𝑘 )𝑛𝑘=1 in the validation data 𝑉 , we sample the actionsfrom the policy network Φ (line 3) and adjust the embedding sizesof them accordingly (line 4). Note that we just change the embed-ding sizes temporarily. We then input the user-item pairs into therecommendation model Θ to compute the rewards (line 5). Next,the original embedding sizes are recovered. Based on the rewards,the policy network Φ is updated by Eq 7 (line 6). In the model opti-mization stage, we use the current batch of stream data 𝐷 to updatethe recommendation model Θ (line 7). Given the state 𝑠𝐷 , we selectthe action 𝑎𝐷 with the highest probabilities predicted by the policynetwork Φ (line 8) and permanently adjust the embedding sizesaccordingly (line 9). Next, we optimize the recommendation modelΘ by minimizing the MSE loss in Eq 3 or the CE loss in Eq 4 (line10). The above procedure is repeated until the recommendationmodel Θ and the policy network Φ converge.

3 EXPERIMENTTo validate the performance of our proposed framework, we con-duct extensive experiments on two large-scale real-world recom-mendation datasets. Through the experiments, we seek to answerthe following three questions:

• How does our proposed framework perform compared withthe traditional deep recommendation models with fixed em-bedding sizes? In other words, does dynamically adjust em-bedding sizes boost the performance on recommendationtasks?• How does our framework perform compared with otherrepresentative AutoML-based recommendation models?• Can our method alleviate the cold-start problem and reducememory consumption?

In this section, we will first introduce the experimental settingsincluding the datasets, the baselines and the implementation details.Then we will present the overall performance comparison, theperformance comparison with frequency andmemory consumptioncomparison with discussions.

3.1 DatasetsOur proposed method is evaluated on the following two popularpublic recommendation benchmark datasets:• MovieLens 20M Dataset3 (ml-20m): A movie rating datasetcollected from the personalized movie recommendation web-site MovieLens4. The dataset contains 20 million 5-star rat-ings and 465,000 free-text tags applied to 27,278 movies pro-vided by 138,493 users. The users are randomly selected andevery selected is guaranteed to have rated at least 20 movies.• MovieLens Latest Datasets5 (ml-latest): A newly releasedmovie rating dataset collected from the same website. Thedataset contains 27 million 5-star ratings and 1.1 million tagsapplied to 58,098 movies from 283,228 users. In this dataset,the users are also randomly selected and all the selectedusers had rated at least 1 movie.

3.2 Implementation DetailsIn this subsection, we present the implementation details of ourrecommendation model and the embedding size adjustment policynetwork such as the selection of model structures and the settingof the hyper-parameters.

First, we set six candidate embedding sizes𝐷 = {2, 4, 8, 16, 64, 128}for both the users and items. Second, for the recommendationmodel,we adopt one hidden layer with a size of 512. It is trained via theAdam [10] optimizer with an initial learning rate of 0.001. Third,for the policy network, we also adopt one hidden layer with a sizeof 512. We use the last 𝑇 = 5 prediction losses to compute thereward. The policy network is trained by Adam optimizer with aninitial learning rate of 0.0001. The whole framework is trained onmini-batches with a size of 500.

In the streaming recommendation setting, there is no explicitlysplit for training and test sets. Given a mini-batch from the datastream, the recommendation model is first tested on it and thentrained on it. The model is evaluated and trained alternatively fromthe beginning to the end. Hence, for both of the two datasets, weartificially choose the last 5 million transactions as the “test set”and report the performance on it. Note that when evaluating the

3https://grouplens.org/datasets/movielens/20m/4https://movielens.org/5https://grouplens.org/datasets/movielens/latest/

Industry (SIRIP) Papers II SIGIR ’20, July 25–30, 2020, Virtual Event, China

2312

Page 7: Automated Embedding Size Search in Deep Recommender Systemszhaoxi35/paper/sigir2020.pdf · Recommender systems have been widely deployed in various com-mercial scenarios from e-commerce

model on the “test set”, the model keeps being updated. All baselinemodels follow the same setting as our model.

3.3 BaselinesWe compare our proposed framework with the following threebaseline methods:

• FIXED: The original recommendation model with embed-dings of uniform and fixed sizes. We adopt the embeddingsize of 128 for all the users and items.• DARTS [12]: A traditional DARTS method that adjusts em-bedding sizes by learning weights for the six candidate em-bedding sizes {2, 4, 8, 16, 64, 128} for each user or item. Thenthe embedding of a user or an item is represented as theweighted sum of the embeddings of the six candidate sizes.• AutoEmb [28]: A DARTS-based method that trains a con-troller to dynamically assign weights for the six candidateembedding sizes {2, 4, 8, 16, 64, 128} based on the frequencyof a given user or item. The embeddings are also representedby soft selection.

3.4 Tasks and Evaluation MetricsWe evaluate the baselines and our model on a binary classificationtask and a multiclass classification task.

• Binary Classification Task:We transform the 5-star ratings tobinary labels where ratings of 4 and 5 are viewed as positive(i.e. 1) and the rest as negative (i.e. 0), and ask the recommen-dation model to predict the correct label. We use predictiveprobability 0.5 as the threshold to assign the label. We mea-sure the performance of the models by the classificationaccuracy and the mean-squared-error loss.• Multiclass Classification Task:We view the 5-star ratings as 5classes and ask the model to predict the correct rating. In thistask, we evaluate the models according to the classificationaccuracy and the cross-entropy loss.

3.5 Overall Performance ComparisonIn Table 1, we show the overall performances of the baselines aswell as our model on the two datasets. Based on Table 1, we makethe following observations:

• First, our model outperforms the baseline FIXED with asignificant margin in terms of all metrics on all the datasetsand tasks. It verifies that by adopting dynamically changingembedding sizes for users and items, we can train a betterrecommendation model with boosted performance.• Second, the baselines DARTS and AutoEmb achieve bet-ter results than FIXED. DARTS and AutoEmb perform asoft selection on a set of embeddings of various sizes withdynamic weights, they realize the dynamic adjustment ofembedding sizes to some extent and get the performancesimproved. What’s more, AutoEmb outperforms DARTS. Dif-ferent from DARTS, AutoEmb introduces a controller toadjust the weights of different embedding sizes based onthe frequencies of the users and items. The improvementdemonstrates that the selection of a reasonable embedding

Table 1: Overall performance comparison.

ml-20m

Models Binary Multiclass

Accuracy (%) MSE Loss Accuracy (%) CE LossFIXED 72.13 0.1845 49.45 1.1517DARTS 72.18 0.1836 49.85 1.1423AutoEmb 72.27 0.1828 49.94 1.1399ESAPN 72.98 0.1785 51.10 1.1126

ml-latest

Models Binary Multiclass

Accuracy (%) MSE Loss Accuracy (%) CE LossFIXED 72.13 0.1845 50.01 1.1414DARTS 72.22 0.1834 50.46 1.1304AutoEmb 72.36 0.1823 50.47 1.1311ESAPN 72.88 0.1790 51.11 1.1147

size is related to the frequency and it can serve as evidencethat helps us to adjust the embedding sizes.• Third, our model outperforms all the baseline models. Theembeddings used in DARTS and AutoEmb are soft combina-tions of embeddings of different sizes. Although the modelscan learn to assign a high weight on the appropriate size,the other embedding sizes still get non-zero weights andcontribute to the final embeddings, which results in inaccu-rate predictions. On the contrary, our model performs hardselection, which accurately chooses one embedding size ateach time and avoids the influence of other embedding sizes.This design makes the model easier to train and enables themodel to make more accurate inferences as well.

3.6 Performance Comparison with FrequencyIn this section, we compare the performances of our model withbaselines on users/items with varied frequencies to further un-derstand the superiority of our model. In Figure 4, we show theaverage performances of the models on users/items with frequen-cies from 1 to 100. We show the accuracy and MSE loss for thebinary classification task and the accuracy and CE loss for the mul-ticlass classification task on the ml-20m dataset. From the figures,we can see that: (i) All the models perform better on users/itemswith higher frequencies than those with lower frequencies. It iseasy to explain since there are more training examples of frequentusers/items and we are aware of more information about them. (ii)Our model achieves better performances than the baselines. (iii)If the frequencies of the users/items are lower, the improvementsin our model compared with the baselines are larger. The baselinemethods with a soft selection of embedding sizes suffer from thecold-start problem. They adopt the combination of embeddings ofvarious sizes as an embedding, which is hard to train when theuser or the item appears few times. Moreover, they could hardlyassign all the weights to small embedding sizes for infrequent usersand items like hard selection, which inevitably brings the noisefrom other embedding sizes. On the contrary, our model learns to

Industry (SIRIP) Papers II SIGIR ’20, July 25–30, 2020, Virtual Event, China

2313

Page 8: Automated Embedding Size Search in Deep Recommender Systemszhaoxi35/paper/sigir2020.pdf · Recommender systems have been widely deployed in various com-mercial scenarios from e-commerce

0 20 40 60 80 1000.60

0.62

0.64

0.66

0.68

0.70

0.72

User

(a1) Accuracy

FIXED Darts AutoEmb ESAPN

0 20 40 60 80 1000.18

0.19

0.20

0.21

0.22

0.23

0.24(b1) MSE Loss

0 20 40 60 80 100

0.36

0.38

0.40

0.42

0.44

0.46

0.48

0.50(c1) Accuracy

0 20 40 60 80 1001.10

1.15

1.20

1.25

1.30

1.35

1.40(d1) CE Loss

0 20 40 60 80 100Frequency

0.55

0.60

0.65

0.70

0.75

Mov

ie

(a2) Accuracy

0 20 40 60 80 100Frequency

0.18

0.20

0.22

0.24

0.26

0.28(b2) MSE Loss

0 20 40 60 80 100Frequency

0.300

0.325

0.350

0.375

0.400

0.425

0.450

0.475(c2) Accuracy

0 20 40 60 80 100Frequency

1.2

1.3

1.4

1.5

(d2) CE Loss

Figure 4: Performance Comparison on users and items with varied frequencies.

adaptively assign small embedding sizes for infrequent users anditems so that the cold-start problem is alleviated.

To understand how our model chooses the embedding sizesin terms of the frequencies, we draw Figure 5. In this figure, weshow the average embedding sizes chosen for users/items withfrequencies ranging from 1 to 50. We show the results on the ml-20m dataset in terms of the binary andmulticlass classification tasks.We can see that the model adaptively selects small embedding sizesfor infrequent users/items and large ones for frequent users/items.This supports the above observation – our model is more robust tothe frequencies of users and items.

3.7 Memory Consumption ComparisonIn the traditional recommendation models, the embeddings of usersand items are uniform and fixed. Thus the storage space needed tostore the embedding table grows linearly with the vocabulary size[9]. However, real-world recommender systems typically involvea huge number of users and items, which need massive storageresources. By adaptively assigning embedding sizes for differentusers and items, our model can reduce the unnecessary usage ofhigh dimensional embeddings so that the storage space can be savedsignificantly. Note that the embedding layer often takes up most ofthe parameters of a deep recommender system [7]; thus reducingthe embedding sizes can evidently save the memory. We showthe numbers of different embedding sizes the model has assignedto different users and items at the end of the data stream in themiddle section of Table 2. We illustrate the comparison of memoryconsumption between ourmodel and the baseline FIXED in the right

0 10 20 30 40 5020406080

100

User

Em

b Di

ms

(a1) Binary

0 10 20 30 40 5020406080

100120

(a2) Multiclass

0 10 20 30 40 50Frequency

30

60

90

120

Mov

ie E

mb

Dim

s

(b1) Binary

0 10 20 30 40 50Frequency

30

60

90

120(b2) Multiclass

Figure 5: The average embedding sizes selected for users anditems with various frequencies.

section of Table 2. The baseline FIXED uses the maximum candidateembedding size 128 for all the users and items. The column “TotalDim” indicates the total number of embedding dimensions assignedto all the users/items of our model. As a comparison, the column“FIXED” shows the total number of embedding dimensions neededin the FIXED model, that is, #(𝑢𝑠𝑒𝑟 ) × 128 or #(𝑖𝑡𝑒𝑚) × 128. The last

Industry (SIRIP) Papers II SIGIR ’20, July 25–30, 2020, Virtual Event, China

2314

Page 9: Automated Embedding Size Search in Deep Recommender Systemszhaoxi35/paper/sigir2020.pdf · Recommender systems have been widely deployed in various com-mercial scenarios from e-commerce

Table 2: Memory consumption comparison.

2 4 8 16 64 128 Total Dim FIXED Ratiouser (ml-20m) 1,041 10,854 18,146 30,836 22,830 54,786 9,157,770 17,727,104 51.66%movie (ml-20m) 17,032 216 380 623 2,237 6,790 1,060,224 3,491,584 30.37%user (ml-latest) 62,808 83,190 31,450 22,778 31,613 51,389 9,675,448 36,253,184 26.69%movie (ml-latest) 45,430 2,089 2,006 1,678 1,545 5,350 925,792 7,436,544 12.45%

column “Ratio” shows the ratio of total embedding dimensions ofour model to these of the FIXED model. All the results are collectedin the binary classification task. We can see that our model assignslow-dimensional embeddings to a lot of infrequent users and itemsand decreases the memory usage by around 40%-90% comparedwith the traditional recommendation model with uniform and fixedembedding sizes. Moreover, the AutoML-based models, DARTS andAutoEmb, assign the total embedding size of 222 to all the users anditems. They occupy 1.73 times memory as the FIXED model does.Compared with these AutoML models, our model’s advantage ofreducing memory consumption is even more significant.

4 RELATEDWORKIn this section, we go over the related works. We first review thelatest studies in deep recommender systems and then discuss relatedworks about AutoML for neural architecture search.

4.1 Deep Recommender SystemDeep recommender systems enhance recommendation performanceand overcome the limitations of traditional approaches [27]. Weclassify the existing works according to the types of deep learn-ing techniques. Wide&Deep [5] is an MLP-based model that canhandle both regression and classification tasks. The Wide part isa single perceptron layer that can be viewed as a linear model,and the Deep part is MLP that tries to capture more general andabstract representations. AutoRec [19] is an AutoEncoder basedmodel which has two variants, item-based and user-based AutoRec,which aim to learn the low-dimension feature embeddings of usersand items. Wang et al. [22] studied the influences of visual featuresextracted by CNNs for POI (Point-of-Interest) recommendations.The model is based on PMF to learn the interactions between visualcontent and latent user/location factors. GRU4Rec [6] is an RNNbased method to model the sequential influence of items’ transitionfor the session-based recommendation, where a session-parallelmini-batches algorithm and a sampling method for output are pro-posed to increase the training efficiency. Chen et al. [3] proposedan Attention-based collaborative filtering model with two levels ofattention mechanisms, where the item-level one selects the mostinformative items to represent users, and the component-level oneselects the most informative features from auxiliary sources foreach user. Zhao et al. [29–32] proposed a series of deep reinforce-ment learning based recommender systems, which typically modelthe user-system interaction as Markov decision process, and aimto maximize the long-term reward from the environment (users).IRGAN [21] is the first GAN-based model for information retrieval,which validated the capability of GAN in web search, item recom-mendation and question answering tasks.

4.2 AutoML Methods for NASAnother category related to this work is AutoML for neural architec-ture search. The first paper in this area is NAS [33] which leveragesreinforcement learning with RNN to train lots of candidate com-ponents to convergence. Because of the high training cost, lots offollow-up works aim to develop NAS with lower resource require-ments. One category is to build a large model that connects smallermodel components, and select a subset of model components foreach candidate architectures, so that the optimal architecture istrained in a single run. For instance, ENAS [16] sample a subset ofmodels by a controller, while SMASH [1] generates weights for sam-pled networks by a hyper-network. DARTS[12] and SNAS[25] viewconnections as weights and optimize them via back-propagation.Luo et al. [13] project the neural architectures into embeddings,then the optimal embedding can be trained and decoded back to thefinal architecture. Another category is to reduce the search space.One way is to propose searching convolution cells which could bestacked repeatedly into a deep network. NASNet [34] uses transferlearning to train cells on smaller datasets, and then apply themon larger datasets. MNAS [20] proposed a search space involvinghierarchical convolution cell blocks which could be searched inde-pendently to form different structures. Joglekar [8] firstly utilizedNAS to optimize embedding components of recommender systems.Nevertheless, it cannot be directly deployed in the streaming recom-mender systems, where the frequencies of users/items are unknownin advance and highly dynamic. The work [28] investigates dynam-ically searching embedding sizes for users and items based on theirpopularity in the streaming setting. Based on DARTS, they train acontroller to assign weights on various candidate embedding sizesand calculate a combined embedding for each user and item. Theirmethod suffers from the drawbacks of soft selection such as thecold-start problem and excessive memory consumption.

5 CONCLUSIONDeep recommendation models typically use uniform and fixed em-bedding sizes for all users and items. This design is sub-optimalfor real-world recommendation problems where the frequencies ofusers and items are highly varied and change dynamically. Also, itcould require much more storage resources. In this paper, we pro-pose to dynamically search embedding sizes for different users anditems. We develop a novel embedding size adjustment policy net-work to automatically search embedding sizes for users and itemsin the recommendation process. It applies hard selection on embed-ding sizes and overcomes the drawbacks of soft selection adoptedby previous works. We formulate the optimization of the policynetwork as a reinforcement learning problem and propose the cor-responding optimization method. We evaluate the performance of

Industry (SIRIP) Papers II SIGIR ’20, July 25–30, 2020, Virtual Event, China

2315

Page 10: Automated Embedding Size Search in Deep Recommender Systemszhaoxi35/paper/sigir2020.pdf · Recommender systems have been widely deployed in various com-mercial scenarios from e-commerce

our framework on two real-world recommendation datasets and theresults demonstrate the superiority of our model over competitivebaselines. Finally, we show that our model effectively alleviates thecold-start problem and reduces storage space significantly.

In this paper, we investigate how to dynamically adjust the em-bedding sizes for users/items based on their frequencies. Otherinformation about a user/item may also be helpful to determine anappropriate embedding size. We will attempt to incorporate otherinformation. Moreover, the representation layer is only a part ofthe recommender system. As one future research direction, we aregoing to investigate how to design the network structure in theinference layer automatically to further boost the performance ofthe model.

ACKNOWLEDGMENTSHaochen Liu, Xiangyu Zhao and Jiliang Tang are supported bythe National Science Foundation (NSF) under grant number IIS-1907704, IIS-1928278, IIS-1714741, IIS-1715940, IIS-1845081 and CNS-1815636.

REFERENCES[1] Andrew Brock, Theodore Lim, James M Ritchie, and Nick Weston. 2017. Smash:

one-shot model architecture search through hypernetworks. arXiv preprintarXiv:1708.05344 (2017).

[2] Shiyu Chang, Yang Zhang, Jiliang Tang, Dawei Yin, Yi Chang, Mark A. Hasegawa-Johnson, and Thomas S. Huang. 2017. Streaming Recommender Systems. InProceedings of the 26th International Conference on World Wide Web, WWW 2017,Perth, Australia, April 3-7, 2017. 381–389. https://doi.org/10.1145/3038912.3052627

[3] Jingyuan Chen, Hanwang Zhang, Xiangnan He, Liqiang Nie, Wei Liu, and Tat-Seng Chua. 2017. Attentive collaborative filtering: Multimedia recommendationwith item-and component-level attention. In Proceedings of the 40th InternationalACM SIGIR conference on Research and Development in Information Retrieval.335–344.

[4] Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra,Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, RohanAnil, Zakaria Haque, Lichan Hong, Vihan Jain, Xiaobing Liu, and Hemal Shah.2016. Wide & Deep Learning for Recommender Systems. In Proceedings of the 1stWorkshop on Deep Learning for Recommender Systems, DLRS@RecSys 2016, Boston,MA, USA, September 15, 2016. 7–10. https://doi.org/10.1145/2988450.2988454

[5] Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra,Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, et al.2016. Wide & deep learning for recommender systems. In Proceedings of the 1stworkshop on deep learning for recommender systems. ACM, 7–10.

[6] Balázs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk.2015. Session-based recommendations with recurrent neural networks. arXivpreprint arXiv:1511.06939 (2015).

[7] Manas R. Joglekar, Cong Li, Jay K. Adams, Pranav Khaitan, and Quoc V. Le.2019. Neural Input Search for Large Scale Recommendation Models. CoRRabs/1907.04471 (2019). arXiv:1907.04471 http://arxiv.org/abs/1907.04471

[8] Manas R Joglekar, Cong Li, Jay K Adams, Pranav Khaitan, and Quoc V Le. 2019.Neural input search for large scale recommendation models. arXiv preprintarXiv:1907.04471 (2019).

[9] Wang-Cheng Kang, Derek Zhiyuan Cheng, Ting Chen, Xinyang Yi, Dong Lin,Lichan Hong, and Ed H. Chi. 2020. Learning Multi-granular Quantized Embed-dings for Large-Vocab Categorical Features in Recommender Systems. CoRRabs/2002.08530 (2020). arXiv:2002.08530 https://arxiv.org/abs/2002.08530

[10] Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochas-tic Optimization. In 3rd International Conference on Learning Representations,ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings.http://arxiv.org/abs/1412.6980

[11] PN Vijaya Kumar and V Raghunatha Reddy. 2014. A survey on recommendersystems (RSS) and its applications. International Journal of Innovative Research inComputer and Communication Engineering 2, 8 (2014), 5254–5260.

[12] Hanxiao Liu, Karen Simonyan, and Yiming Yang. 2019. DARTS: DifferentiableArchitecture Search. In 7th International Conference on Learning Representations,ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. https://openreview.net/forum?id=S1eYHoC5FX

[13] Renqian Luo, Fei Tian, Tao Qin, Enhong Chen, and Tie-Yan Liu. 2018. Neuralarchitecture optimization. In Proceedings of the 32nd International Conference onNeural Information Processing Systems. 7827–7838.

[14] Hanh T. H. Nguyen, Martin Wistuba, Josif Grabocka, Lucas Rêgo Drumond, andLars Schmidt-Thieme. 2017. Personalized Deep Learning for Tag Recommen-dation. In Advances in Knowledge Discovery and Data Mining - 21st Pacific-AsiaConference, PAKDD 2017, Jeju, South Korea, May 23-26, 2017, Proceedings, Part I.186–197. https://doi.org/10.1007/978-3-319-57454-7_15

[15] Junwei Pan, Jian Xu, Alfonso Lobos Ruiz, Wenliang Zhao, Shengjun Pan, Yu Sun,and Quan Lu. 2018. Field-weighted Factorization Machines for Click-ThroughRate Prediction in Display Advertising. In Proceedings of the 2018 World WideWeb Conference on World Wide Web, WWW 2018, Lyon, France, April 23-27, 2018.1349–1357. https://doi.org/10.1145/3178876.3186040

[16] Hieu Pham, Melody Guan, Barret Zoph, Quoc Le, and Jeff Dean. 2018. EfficientNeural Architecture Search via Parameters Sharing. In International Conferenceon Machine Learning. 4095–4104.

[17] Yanru Qu, Bohui Fang, Weinan Zhang, Ruiming Tang, Minzhe Niu, HuifengGuo, Yong Yu, and Xiuqiang He. 2019. Product-Based Neural Networks for UserResponse Prediction over Multi-Field Categorical Data. ACM Trans. Inf. Syst. 37,1 (2019), 5:1–5:35. https://doi.org/10.1145/3233770

[18] Francesco Ricci, Lior Rokach, and Bracha Shapira. 2011. Introduction to rec-ommender systems handbook. In Recommender systems handbook. Springer,1–35.

[19] Suvash Sedhain, Aditya Krishna Menon, Scott Sanner, and Lexing Xie. 2015.Autorec: Autoencoders meet collaborative filtering. In Proceedings of the 24thinternational conference on World Wide Web. 111–112.

[20] Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark Sandler, AndrewHoward, and Quoc V Le. 2019. Mnasnet: Platform-aware neural architecturesearch for mobile. In Proceedings of the IEEE Conference on Computer Vision andPattern Recognition. 2820–2828.

[21] Jun Wang, Lantao Yu, Weinan Zhang, Yu Gong, Yinghui Xu, Benyou Wang, PengZhang, and Dell Zhang. 2017. Irgan: A minimax game for unifying generativeand discriminative information retrieval models. In Proceedings of the 40th In-ternational ACM SIGIR conference on Research and Development in InformationRetrieval. 515–524.

[22] SuhangWang, Yilin Wang, Jiliang Tang, Kai Shu, Suhas Ranganath, and Huan Liu.2017. What your images reveal: Exploiting visual contents for point-of-interestrecommendation. In Proceedings of the 26th International Conference on WorldWide Web. 391–400.

[23] Ronald J. Williams. 1992. Simple Statistical Gradient-Following Algorithmsfor Connectionist Reinforcement Learning. Mach. Learn. 8 (1992), 229–256.https://doi.org/10.1007/BF00992696

[24] Sai Wu, Weichao Ren, Chengchao Yu, Gang Chen, Dongxiang Zhang, and JingboZhu. 2016. Personal recommendation using deep recurrent neural networks inNetEase. In 32nd IEEE International Conference on Data Engineering, ICDE 2016,Helsinki, Finland, May 16-20, 2016. 1218–1229. https://doi.org/10.1109/ICDE.2016.7498326

[25] Sirui Xie, Hehui Zheng, Chunxiao Liu, and Liang Lin. 2018. SNAS: stochasticneural architecture search. arXiv preprint arXiv:1812.09926 (2018).

[26] Shuai Zhang, Lina Yao, Aixin Sun, and Yi Tay. 2019. Deep Learning BasedRecommender System: A Survey and New Perspectives. ACM Comput. Surv. 52,1 (2019), 5:1–5:38. https://doi.org/10.1145/3285029

[27] Shuai Zhang, Lina Yao, Aixin Sun, and Yi Tay. 2019. Deep learning based rec-ommender system: A survey and new perspectives. ACM Computing Surveys(CSUR) 52, 1 (2019), 1–38.

[28] Xiangyu Zhao, Chong Wang, Ming Chen, Xudong Zheng, Xiaobing Liu, andJiliang Tang. 2020. AutoEmb: Automated Embedding Dimensionality Search inStreaming Recommendations. CoRR abs/2002.11252 (2020). arXiv:2002.11252https://arxiv.org/abs/2002.11252

[29] Xiangyu Zhao, Long Xia, Liang Zhang, Zhuoye Ding, Dawei Yin, and JiliangTang. 2018. Deep Reinforcement Learning for Page-wise Recommendations. InProceedings of the 12th ACM Recommender Systems Conference. ACM, 95–103.

[30] Xiangyu Zhao, Long Xia, Yihong Zhao, Dawei Yin, and Jiliang Tang. 2019.Model-Based Reinforcement Learning forWhole-Chain Recommendations. arXivpreprint arXiv:1902.03987 (2019).

[31] Xiangyu Zhao, Liang Zhang, Zhuoye Ding, Long Xia, Jiliang Tang, and Dawei Yin.2018. Recommendations with Negative Feedback via Pairwise Deep Reinforce-ment Learning. In Proceedings of the 24th ACM SIGKDD International Conferenceon Knowledge Discovery & Data Mining. ACM, 1040–1048.

[32] Xiangyu Zhao, Xudong Zheng, Xiwang Yang, Xiaobing Liu, and Jiliang Tang.2020. Jointly Learning to Recommend and Advertise. In Proceedings of the 26thACM SIGKDD International Conference on Knowledge Discovery & Data Mining.ACM.

[33] Barret Zoph and Quoc V Le. 2016. Neural architecture search with reinforcementlearning. arXiv preprint arXiv:1611.01578 (2016).

[34] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. 2018. Learningtransferable architectures for scalable image recognition. In Proceedings of theIEEE conference on computer vision and pattern recognition. 8697–8710.

Industry (SIRIP) Papers II SIGIR ’20, July 25–30, 2020, Virtual Event, China

2316