29
Learning to Match Using Local and Distributed Representations of Text for Web Search Nick Craswell Microsoft Bellevue, USA [email protected] *work done while at Microsoft Fernando Diaz Spotify* New York, USA [email protected] Bhaskar Mitra Microsoft, UCL Cambridge, UK [email protected] The Duet Model:

The Duet model

Embed Size (px)

Citation preview

Page 1: The Duet model

Learning to Match Using Local and Distributed

Representations of Text for Web Search

Nick CraswellMicrosoft

Bellevue, USA

[email protected]

*work done while at Microsoft

Fernando DiazSpotify*

New York, USA

[email protected]

Bhaskar MitraMicrosoft, UCL

Cambridge, UK

[email protected]

The Duet Model:

Page 2: The Duet model

The document ranking task

Given a query rank documents

according to relevance

The query text has few terms

The document representation can be

long (e.g., body text) or short (e.g., title)

query

ranked results

search engine w/ an

index of retrievable items

Page 3: The Duet model

This paper is focused on ranking documents

based on their long body text

Page 5: The Duet model

But few for long document ranking…

(Guo et al., 2016)

(Salakhutdinov and Hinton, 2009)

Page 6: The Duet model

Challenges in short vs. long text retrieval

Short-text

Vocabulary mismatch more serious problem

Long-text

Documents contain mixture of many topics

Matches in different parts of the document non-uniformly important

Term proximity is important

Page 7: The Duet model

The “black swans” of Information Retrieval

The term black swan originally referred to impossible

events. In 1697, Dutch explorers encountered black

swans for the very first time in western Australia. Since

then, the term is used to refer to surprisingly rare events.

In IR, many query terms and intents are

never observed in the training data

Exact matching is effective in making the

IR model robust to rare events

Page 8: The Duet model

Desiderata of document ranking

Exact matching

Important if query term is rare / fresh

Frequency and positions of matches

good indicators of relevance

Term proximity is important

Inexact matching

Synonymy relationships

united states president ↔ Obama

Evidence for document aboutness

Documents about Australia likely to contain

related terms like Sydney and koala

Proximity and position is important

Page 9: The Duet model

Different text representations for matching

Local representation

Terms are considered distinct entities

Term representation is local (one-hot vectors)

Matching is exact (term-level)

Distributed representation

Represent text as dense vectors (embeddings)

Inexact matching in the embedding space

Local (one-hot) representation Distributed representation

Page 10: The Duet model

A tale of two queries

“pekarovic land company”

Hard to learn good representation for

rare term pekarovic

But easy to estimate relevance based

on patterns of exact matches

Proposal: Learn a neural model to

estimate relevance from patterns of

exact matches

“what channel are the seahawks on today”

Target document likely contains ESPN

or sky sports instead of channel

An embedding model can associate

ESPN in document to channel in query

Proposal: Learn embeddings of text

and match query with document in

the embedding space

The Duet Architecture

Use a neural network to model both functions and learn their parameters jointly

Page 11: The Duet model

The Duet architecture

Linear combination of two models

trained jointly on labelled query-

document pairs

Local model operates on lexical

interaction matrix

Distributed model projects n-graph

vectors of text into an embedding

space and then estimates match

Sum

Query text

Generate query

term vector

Doc text

Generate doc

term vector

Generate interaction matrix

Query

term vector

Doc

term vector

Local model

Fully connected layers for matching

Query text

Generate query

embedding

Doc text

Generate doc

embedding

Hadamard product

Query

embedding

Doc

embedding

Distributed model

Fully connected layers for matching

Page 12: The Duet model

Local model

Page 13: The Duet model

Local model: term interaction matrix

𝑋𝑖,𝑗 = 1, 𝑖𝑓 𝑞𝑖 = 𝑑𝑗0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

In relevant documents,

→Many matches, typically clustered

→Matches localized early in document

→Matches for all query terms

→In-order (phrasal) matches

Page 14: The Duet model

Local model: estimating relevance

← document words →

Convolve using window of size 𝑛𝑑 × 1

Each window instance compares a query term w/

whole document

Fully connected layers aggregate evidence

across query terms - can model phrasal matches

Page 15: The Duet model

Distributed model

Page 16: The Duet model

Distributed model: input representation

dogs → [ d , o , g , s , #d , do , og , gs , s# , #do , dog , ogs , gs#, #dog, dogs, ogs#, #dogs, dogs# ]

(we consider 2K most popular n-graphs only for encoding)

d o g s h a v e o w n e r s c a t s h a v e s t a f f

n-g

rap

h

enco

din

g

concatenate

Ch

an

nels =

2K

[words x channels]

Page 17: The Duet model

con

volu

tio

np

oo

ling

Query

embedding

Had

am

ard

pro

duct

Had

am

ard

pro

duct

Fu

lly c

on

nect

ed

query document

Distributed model: estimating relevance

Convolve over query and

document terms

Match query with moving

windows over document

Learn text embeddings

specifically for the task

Matching happens in

embedding space

* Network architecture slightly

simplified for visualization – refer paper

for exact details

Page 18: The Duet model

Putting the two models together…

Page 19: The Duet model

The Duet model

Training sample: 𝑄,𝐷+, 𝐷1− 𝐷2

− 𝐷3− 𝐷4

𝐷+ = 𝐷𝑜𝑐𝑢𝑚𝑒𝑛𝑡 𝑟𝑎𝑡𝑒𝑑 𝐸𝑥𝑐𝑒𝑙𝑙𝑒𝑛𝑡 𝑜𝑟 𝐺𝑜𝑜𝑑𝐷− = 𝐷𝑜𝑐𝑢𝑚𝑒𝑛𝑡 2 𝑟𝑎𝑡𝑖𝑛𝑔𝑠 𝑤𝑜𝑟𝑠𝑒 𝑡ℎ𝑎𝑛 𝐷+

Optimize cross-entropy loss

Implemented using CNTK (GitHub link)

Page 20: The Duet model

Data

Need large-scale training data

(labels or clicks)

We use Bing human labelled

data for both train and test

Page 21: The Duet model

Results

Key finding: Duet performs significantly better than local and distributed

models trained individually

Page 22: The Duet model

Random negatives vs. judged negatives

Key finding: training w/ judged

bad as negatives significantly

better than w/ random negatives

Page 23: The Duet model

Local vs. distributed model

Key finding: local and distributed

model performs better on

different segments, but

combination is always better

Page 24: The Duet model

Effect of training data volume

Key finding: large quantity of training data necessary for learning good

representations, less impactful for training local model

Page 25: The Duet model

Term importance

Local model

Only query terms have an impact

Earlier occurrences have bigger impact

Query: united states president

Visualizing impact of dropping terms on model score

Page 26: The Duet model

Term importance

Distributed model

Non-query terms (e.g., Obama and

federal) has positive impact on score

Common words like ‘the’ and ‘of’

probably good indicators of well-

formedness of content

Query: united states president

Visualizing impact of dropping terms on model score

Page 27: The Duet model

Types of models

If we classify models by query level performance there is a clear clustering of lexical (local) and semantic (distributed) models

Page 28: The Duet model

Duet on other IR tasks

Promising early results on TREC

2017 Complex Answer Retrieval

(TREC-CAR)

Duet performs significantly

better when trained on large

data (~32 million samples)

(PAPER UNDER REVIEW)

Page 29: The Duet model

Summary

Both exact and inexact matching is important for IR

Deep neural networks can be used to model both types of matching

Local model more effective for queries containing rare terms

Distributed model benefits from training on large datasets

Combine local and distributed model to achieve state-of-the-art performance

Get the model:

https://github.com/bmitra-msft/NDRM/blob/master/notebooks/Duet.ipynb