The Duet model

Learning to Match Using Local and Distributed

Representations of Text for Web Search

Nick CraswellMicrosoft

Bellevue, USA

[email protected]

*work done while at Microsoft

Fernando DiazSpotify*

New York, USA

[email protected]

Bhaskar MitraMicrosoft, UCL

Cambridge, UK

[email protected]

The Duet Model:

mailto:[email protected]



The document ranking task

Given a query rank documents

according to relevance

The query text has few terms

The document representation can be

long (e.g., body text) or short (e.g., title)

query

ranked results

search engine w/ an

index of retrievable items

This paper is focused on ranking documents

based on their long body text

Many DNN models for short text ranking

(Huang et al., 2013)

(Severyn and Moschitti, 2015)

(Shen et al., 2014)

(Palangi et al., 2015)

(Hu et al., 2014)

(Tai et al., 2015)

http://dl.acm.org/citation.cfm?id=2505665



https://arxiv.org/abs/1412.6629

https://papers.nips.cc/paper/5550-convolutional-neural-network-architectures-for-matching-natural-language-sentences

https://arxiv.org/abs/1503.00075v3.pdf

But few for long document ranking…

(Guo et al., 2016)

(Salakhutdinov and Hinton, 2009)


http://www.sciencedirect.com/science/article/pii/S0888613X08001813

Challenges in short vs. long text retrieval

Short-text

Vocabulary mismatch more serious problem

Long-text

Documents contain mixture of many topics

Matches in different parts of the document non-uniformly important

Term proximity is important

The “black swans” of Information Retrieval

The term black swan originally referred to impossible

events. In 1697, Dutch explorers encountered black

swans for the very first time in western Australia. Since

then, the term is used to refer to surprisingly rare events.

In IR, many query terms and intents are

never observed in the training data

Exact matching is effective in making the

IR model robust to rare events

Desiderata of document ranking

Exact matching

Important if query term is rare / fresh

Frequency and positions of matches

good indicators of relevance

Term proximity is important

Inexact matching

Synonymy relationships

united states president ↔ Obama

Evidence for document aboutness

Documents about Australia likely to contain

related terms like Sydney and koala

Proximity and position is important

Different text representations for matching

Local representation

Terms are considered distinct entities

Term representation is local (one-hot vectors)

Matching is exact (term-level)

Distributed representation

Represent text as dense vectors (embeddings)

Inexact matching in the embedding space

Local (one-hot) representation Distributed representation

A tale of two queries

“pekarovic land company”

Hard to learn good representation for

rare term pekarovic

But easy to estimate relevance based

on patterns of exact matches

Proposal: Learn a neural model to

estimate relevance from patterns of

exact matches

“what channel are the seahawks on today”

Target document likely contains ESPN

or sky sports instead of channel

An embedding model can associate

ESPN in document to channel in query

Proposal: Learn embeddings of text

and match query with document in

the embedding space

The Duet Architecture

Use a neural network to model both functions and learn their parameters jointly

The Duet architecture

Linear combination of two models

trained jointly on labelled query-

document pairs

Local model operates on lexical

interaction matrix

Distributed model projects n-graph

vectors of text into an embedding

space and then estimates match

Sum

Query text

Generate query

term vector

Doc text

Generate doc

term vector

Generate interaction matrix

Query

term vector

Doc

term vector

Local model

Fully connected layers for matching

Query text

Generate query

embedding

Doc text

Generate doc

embedding

Hadamard product

Query

embedding

Doc

embedding

Distributed model

Fully connected layers for matching

Local model

Local model: term interaction matrix

𝑋𝑖,𝑗 = 1, 𝑖𝑓 𝑞𝑖 = 𝑑𝑗0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

In relevant documents,

→Many matches, typically clustered

→Matches localized early in document

→Matches for all query terms

→In-order (phrasal) matches

Local model: estimating relevance

← document words →

Convolve using window of size 𝑛𝑑 × 1

Each window instance compares a query term w/

whole document

Fully connected layers aggregate evidence

across query terms - can model phrasal matches

Distributed model

Distributed model: input representation

dogs → [ d , o , g , s , #d , do , og , gs , s# , #do , dog , ogs , gs#, #dog, dogs, ogs#, #dogs, dogs# ]

(we consider 2K most popular n-graphs only for encoding)

d o g s h a v e o w n e r s c a t s h a v e s t a f f

n-g

rap

h

enco

din

g

concatenate

Ch

an

nels =

2K

[words x channels]

con

volu

tio

np

oo

ling

Query

embedding

…

…

…

Had

am

ard

pro

duct

Had

am

ard

pro

duct

Fu

lly c

on

nect

ed

query document

Distributed model: estimating relevance

Convolve over query and

document terms

Match query with moving

windows over document

Learn text embeddings

specifically for the task

Matching happens in

embedding space

* Network architecture slightly

simplified for visualization – refer paper

for exact details

Putting the two models together…

The Duet model

Training sample: 𝑄,𝐷+, 𝐷1− 𝐷2

− 𝐷3− 𝐷4

−

𝐷+ = 𝐷𝑜𝑐𝑢𝑚𝑒𝑛𝑡 𝑟𝑎𝑡𝑒𝑑 𝐸𝑥𝑐𝑒𝑙𝑙𝑒𝑛𝑡 𝑜𝑟 𝐺𝑜𝑜𝑑𝐷− = 𝐷𝑜𝑐𝑢𝑚𝑒𝑛𝑡 2 𝑟𝑎𝑡𝑖𝑛𝑔𝑠 𝑤𝑜𝑟𝑠𝑒 𝑡ℎ𝑎𝑛 𝐷+

Optimize cross-entropy loss

Implemented using CNTK (GitHub link)

https://github.com/bmitra-msft/NDRM/blob/master/notebooks/Duet.ipynb

Data

Need large-scale training data

(labels or clicks)

We use Bing human labelled

data for both train and test

Results

Key finding: Duet performs significantly better than local and distributed

models trained individually

Random negatives vs. judged negatives

Key finding: training w/ judged

bad as negatives significantly

better than w/ random negatives

Local vs. distributed model

Key finding: local and distributed

model performs better on

different segments, but

combination is always better

Effect of training data volume

Key finding: large quantity of training data necessary for learning good

representations, less impactful for training local model

Term importance

Local model

Only query terms have an impact

Earlier occurrences have bigger impact

Query: united states president

Visualizing impact of dropping terms on model score

Term importance

Distributed model

Non-query terms (e.g., Obama and

federal) has positive impact on score

Common words like ‘the’ and ‘of’

probably good indicators of well-

formedness of content

Query: united states president

Visualizing impact of dropping terms on model score

Types of models

If we classify models by query level performance there is a clear clustering of lexical (local) and semantic (distributed) models

Duet on other IR tasks

Promising early results on TREC

2017 Complex Answer Retrieval

(TREC-CAR)

Duet performs significantly

better when trained on large

data (~32 million samples)

(PAPER UNDER REVIEW)

Summary

Both exact and inexact matching is important for IR

Deep neural networks can be used to model both types of matching

Local model more effective for queries containing rare terms

Distributed model benefits from training on large datasets

Combine local and distributed model to achieve state-of-the-art performance

Get the model:



Technology

The Duet model