Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neural Networks

Ask Me Any Rating: A Content-basedRecommender System based onRecurrent Neural Networks7th Italian Information Retrieval WorkshopVenezia (Italy), May 30-31 2016

Cataldo Musto, Claudio Greco, Alessandro Suglia and Giovanni Semeraro

Work supported by the IBM Faculty Award ”Deep Learning to boost Cognitive Question Answering”Titan X GPU used for this research donated by the NVIDIA Corporation

1

Overview

1. Background

Content-based recommender systems

Neural network models

2. Research work

Ask me Any Rating (AMAR)

Experimental evaluation

3. Conclusions

Lesson-learnt

Vision

2

Background

Content-based recommender systems

Consists in matching up the attributes of a user profile withthe attributes of a content object (item) [1]

[1] P. Lops, M. De Gemmis, and G. Semeraro. “Content-based recommender systems:State of the art and trends”. In: Recommender systems handbook. Springer, 2011 3

Deep learning

Definition

Allows computational models that are composed ofmultiple processing layers to learn representations of data

with multiple levels of abstraction [2]

• Discovers intricate structure in large data sets by using thebackpropagation algorithm [3];

• Leads to progressively more abstract features at higher layers ofrepresentations;

• More abstract concepts are generally invariant to most localchanges of the input;

[2] Y. LeCun, Y. Bengio, and G. Hinton. “Deep learning”. In: Nature 521 (2015)[3] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. “Learning representations by

back-propagating errors”. In: Cognitive modeling (1988)

4

Recurrent Neural Networks

• Recurrent Neural Networks (RNN) are architectures suitable tomodel variable-length sequential data [4];

• The connections between their units may contain loops whichlet them consider past states in the learning process;

• Their roots are in the Dynamical System Theory in which thefollowing relation is true:

s(t) = f(s(t−1); x(t); θ)

where s(t) represents the current system state computed by ageneric function f evaluated on the previous state s(t−1), x(t)represents the current input and θ are the network parameters.

[4] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal representationsby error propagation. Tech. rep. DTIC Document, 1985

5

RNN pros and cons

Pros

• Appropriate to represent sequential data;• A versatile framework which can be applied to different tasks;• Can learn short-term and long-term temporal dependencies.

Cons

• Vanishing/exploding gradient problem [5];• Difficulties to reach satisfying minima during the optimization of

the loss function;• Difficult to parallelize the training process.

[5] Y. Bengio, P. Simard, and P. Frasconi. “Learning long-term dependencies withgradient descent is difficult”. In: Neural Networks, IEEE Transactions on 5 (1994)

6

Long Short Term Memory (LSTM)

• A specific RNN introduced to solve the vanishing/explodinggradient problem;

• Each cell presents a complex structure which is more powerfulthan simple RNN cells.

Figure: LSTM architecture [6]

forget gate (f) considers thecurrent input and the previousstate to remove or preserve the

most appropriate information forthe given task

[6] A. Graves, A. Mohamed, and G. Hinton. “Speech recognition with deep recurrentneural networks”. In: Acoustics, Speech and Signal Processing (ICASSP), IEEE 2013

7





input gate (i) considers thecurrent input and the previous

state to determine how the inputinformation will be used to

update the state cell


7





output gate (o) considers thecurrent input, the previous state

and the updated state cell togenerate an appropriate output

for the given task


7

Research work

Ask Me Any Rating (AMAR)

“Mirror, mirror, here I stand.What is the fairest movie in the

land?”

• Inspired by a neural network modelused to solve Question Answeringtoy tasks [7];

• Name adapted from “Ask MeAnything” [8];

• Very simple Factoid QuestionAnswering system where userprofiles are questions and ratingsare answers.

[7] J. Weston et al. “Towards AI-Complete Question Answering: A Set of PrerequisiteToy Tasks”. In: CoRR abs/1502.05698 (2015)

[8] A. Kumar et al. “Ask Me Anything: Dynamic Memory Networks for NaturalLanguage Processing”. In: CoRR abs/1506.07285 (2015)

8


• Two different modules togenerate:

• User embedding• Item embedding

• User embedding associatedto a user identifier;

• Item embedding generatedfrom an item description;

• Concatenation of user anditem embeddings given to alogistic regression layer topredict the probability of a“like”.

b b bw1 w2 wm

User u Item description id

User LT Word LT

v(u)

v(id)

LSTM LSTM LSTM

v(w1) v(w2) v(wm)

h(w1) h(w2) h(wm)

Mean pooling layer

Concatenation layer

Logistic regression layer

9


User embedding

• An identifier u is associated to each user;• The identifier is given as input to a lookup table (User LT);• User LT converts it to a learnt user embedding v(u).

Item embedding

• Each word w1 . . .wm of the item description id is associated to aunique identifier specific of the item descriptions corpus;

• Words identifiers are given as input to a lookup table (Item LT);• Item LT converts them to learnt words embeddings v(wk);• Words embeddings v(wk) are sequentially passed through anRNN with LSTM cells (LSTM module);

• The LSTM module generates a latent representation h(wk) foreach word;

• A mean pooling layer averages the words representationsgenerating an item embedding v(id) for the item i.

10


“Like” probability estimation

• Item and user embeddings, v(id) and v(u), are concatenated in asingle representation;

• The resulting representation is used as feature for theprediction task;

• A Logistic regression layer is used to estimate the probability ofa “like” given by user u to a specific item i;

• The generated score is used to build a sorted list ofrecommended items for user u.

Optimization criterion

• The neural network is trained minimizing the BinaryCross-entropy loss function.

11

AMAR extended

• AMAR extended adds to the AMARarchitecture an additional modulefor items genres;

• An identifier gk is associated to eachitem genre;

• Genres identifier are given as inputto a lookup table (Genre LT);

• Genres LT converts them to learntgenres embeddings v(gk);

• A mean pooling layer averages thegenres representations generating agenres embedding v(ig).

g1 g2 gn

Item genres igj

Genre LT

v(u) v(id)

Mean pooling layer

Concatenation layer

Logistic regression layer

b b bv(g1) v(g2) v(gn)

v(ig)

Item description idUser u

b b b

12

Experimental protocol

• Datasets: Movielens 1M (ML1M) e DBbook;• Text preprocessing: tokenization and stopword removal;• Evaluation strategy: 5-fold cross validation for Movielens 1M,holdout for DBbook;

• Recommendation task: top-N recommendation leveragingbinary user feedback;

• Evaluation strategy for recommendation: TestRatings [9];• Metric: F1-measure evaluated at 5, 10 and 15.

[9] A. Bellogin, P. Castells, and I. Cantador. “Precision-oriented evaluation ofrecommender systems: an algorithmic comparison”. In: Proceedings of the fifth ACMconference on Recommender systems. 2011

13

ML1M

A film dataset created by the research group GroupLens of theUniversity of Minnesota which contains user ratings on a 5-starsscale.

Each rating has been binarized according to the following formula:

bin_rating(r) ={

1, if r ≥ 40, otherwise

#ratings 1000209#users 6040#item 3301avg ratings per user 31.423avg positive ratings per user 17.985avg negative ratings per user 13.439sparsity 0.95

14

DBbook

A book dataset released for the Linked open data-enabledrecommender systems: ESWC 2014 challenge [10].

It contains binary user preferences (e.g., I like it, I don’t like it).

#ratings 72371#users 6181#item 8170avg ratings per user 11.392avg positive ratings per user 6.727avg negative ratings per user 4.665sparsity 0.998

[10] T. Di Noia, I. Cantador, and V. C. Ostuni. “Linked open data-enabledrecommender systems: ESWC 2014 challenge on book recommendation”.In: Semantic Web Evaluation Challenge. Springer, 2014

15

Models configurations

Embedding-based recommenders

W2V Google News (W2V-news)• Method: SG• Embedding size: 300• Corpus: Google News

GloVe• Embedding size: 300• Corpus: Wikipedia 2014 +

Gigaword 5

Baseline recommenders

Item to item CF (I2I) *• Neighbours: 30, 50, 80

User to user CF (U2U) *• Neighbours: 30, 50, 80

SLIM with BPR-Opt (BPRSlim) *TF-IDF

Bayesian Personalized RankingMatrix Factorization (BPRMF) *

• Latent factors: 10, 30, 50Weighted Matrix FactorizationMethod (WRMF) *

• Latent factors: 10, 30, 50

* MyMediaLite implementations 16

Models configurations

AMAR

• Opt. method: RMSprop [11]• α: 0.9• Learning rate: 0.001

• Epochs: 25;• User embedding size: 10;• Item embedding size: 10;• LSTM output size: 10;• Batch size:

• ML1M: 1536• DBbook: 512

AMAR extended

• Opt. method: RMSprop• α: 0.9• Learning rate: 0.001

• Epochs: 25;• User embedding size: 10;• Item embedding size: 10;• Genre embedding size: 10;• LSTM output size: 10;• Batch size:

• ML1M: 1536• DBbook: 512

[11] T. Tieleman and G. E. Hinton. “rmsprop”. In: COURSERA: Neural Networks forMachine Learning Lecture 6.5 (2012) 17

DBbook results

0.662 0.662

0.6550.656

0.640.639

0.631

0.636

0.632

0.662

0.62

0.62

0.63

0.63

0.64

0.64

0.65

0.65

0.66

0.66

0.67

AMAR AMARextended

GloVe W2V-News I2I-30 U2U-30 BPRMF-30 WRMF-50 BPRSlim TF-IDF

F1@

10

RECOMMENDER CONFIGURATIONS

Differences statistically significant according to Wilcoxon test (ρ ≤ 0.05)

18

ML1M results

0.641 0.644

0.575

0.587

0.527 0.525 0.524 0.525

0.548

0.59

0.40

0.45

0.50

0.55

0.60

0.65

AMAR AMAR extended GloVe W2V-News I2I-30 U2U-30 BPRMF-30 WRMF-50 BPRSlim TF-IDF

F1@

10

RECOMMENDER CONFIGURATIONS

Only differences between U2U and GloVe, BPRSlim and GloVe, GloVe and Word2vecare not statistically significant according to Wilcoxon test (ρ ≤ 0.05)

19

Conclusions

AMAR pros and cons

Pros

• High improvement on ML1M;• Able to learn more suitable item and user representations for

the recommendation task;• Item and user embeddings are not generated using a simple

mean, but they are adapted during training.

Cons

• It does not deal well with very sparse datasets:• Small improvement on DBbook

• High training times:• DBbook: 50 minutes per epoch• ML1M: 90 minutes per epoch

20

AMAR Improvements

Optimization

• Use alternative training methods and regularization techniques;• Use pretrained word embeddings;• More appropriate cost functions for top-N recommendation;• Increase embedding dimensions.

Architecture

• Item modeling may be improved by using different neuralnetwork architectures;

• Classification step may be done by using deeper fully connectedlayers.

Additional features

Leverage important data silos to enrich item representations:

• Linked Open Data;• Web and social media. 21

Thanks for your attention

• Design of recommender systemsusing deep neural networks;

• Experimental evaluation onwell-known datasets on thetop-N recommendation task;

• Higher performance using deepmodels than using shallowmodels.

Alessandro [email protected]

Claudio [email protected]

22

[email protected]

[email protected]

Technical details(Warning: for geeks only)

Cross entropy

DefinitionGiven two probability distributions over the same underlying set ofevents, p and q, it measures the average number of bits needed toidentify an event drawn from a set of possibilities, if a codingscheme is used based on an “unnatural” probability distribution q,rather than the “true” distribution p.

Given discrete probability distributions p and q, the cross entropy isdefined as follows:

H(p,q) = −∑xp(x) logq(x)

23

RNN

Given an input vector x(t), bias vectors b, c and weight matrices U, Vand W, a forward step of an RNN neural network is computed in thisway:

at = b+Wst−1 + Uxtst = tanh(at)ot = c+ Vstpt = softmax(ot)

In this case, the activation function are the hyperbolic tangent(tanh) for the hidden layer and the multinomial logistic function(softmax) for the output layer.

24

LSTM

The information flow in an LSTM module is much more complex thanthe one in an RNN. The architecture used in this work uses thefollowing equations, presented in [6]:

it = σ(Wxixt +Whiht−1 +Wcict−1 + bi)ft = σ(Wxfxt +Whfht−1 +Wcfct−1 + bf)ct = ftct−1 + it tanh(Wxcxt +Whcht−1 + bc)ot = σ(Wxoxt +Whoht−1 +Wcoct + bo)ht = ot tanh(ct)

where σ is the logistic sigmoid function, and i, f, o and c arerespectively the input gate, forget gate,output gate and cellactivation vectors, all of which are the same size as the hiddenvector h.

25

Corpus stats

Google News• # tokens: 6B• Vocabulary size: 40K• # matched words:

• DBbook: 44636 (41.52%)• ML1M: 35150 (49.13%)

GloVe• # tokens: 100B• Vocabulary size: 3M• # matched words:

• DBbook: 65013 (60.48%)• ML1M: 49893 (69.74%)

26

References

[1] Pasquale Lops, Marco De Gemmis, and Giovanni Semeraro.“Content-based recommender systems: State of the art andtrends”. In: Recommender systems handbook. Springer, 2011.

[2] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. “Deeplearning”. In: Nature 521 (2015).

[3] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams.“Learning representations by back-propagating errors”. In:Cognitive modeling 5 (1988).

[4] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams.Learning internal representations by error propagation.Tech. rep. DTIC Document, 1985.

[5] Yoshua Bengio, Patrice Simard, and Paolo Frasconi. “Learninglong-term dependencies with gradient descent is difficult”. In:Neural Networks, IEEE Transactions on 5 (1994).

[6] Alan Graves, Abdel-rahman Mohamed, and Geoffrey Hinton.“Speech recognition with deep recurrent neural networks”. In:

26

Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEEInternational Conference on. IEEE. 2013.

[7] Jason Weston et al. “Towards AI-Complete Question Answering:A Set of Prerequisite Toy Tasks”. In: CoRR abs/1502.05698 (2015).

[8] Ankit Kumar et al. “Ask Me Anything: Dynamic MemoryNetworks for Natural Language Processing”. In: CoRRabs/1506.07285 (2015).

[9] Alejandro Bellogin, Pablo Castells, and Ivan Cantador.“Precision-oriented evaluation of recommender systems: analgorithmic comparison”. In: Proceedings of the fifth ACMconference on Recommender systems. 2011.

[10] Tommaso Di Noia, Iván Cantador, and Vito Claudio Ostuni.“Linked open data-enabled recommender systems: ESWC 2014challenge on book recommendation”. In: Semantic WebEvaluation Challenge. Springer, 2014.

26

[11] Tijmen Tieleman and Geoffrey E. Hinton. “rmsprop”. In:COURSERA: Neural Networks for Machine Learning Lecture 6.5(2012).

26

Software

Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neural Networks