Search, Discovery and Questions at Quora

Search, Discovery and the Question namespace

Nikhil Dandekar (@nikhilbd) Sandeep Goyal

5/10/2016

About us

Nikhil Dandekar

● Engineering Manager: Search, Questions and Growth at Quora

● Previously: Foursquare, Bing

● Background in Search, Data, Machine Learning

Sandeep Goyal

● Software Engineer: Search, Questions and Growth at Quora.

● Previously: Playdom, Yahoo

● Background in Growth, Systems

Quora’s Mission

“To share and grow the world’s

knowledge”

● Millions of questions & answers

● Millions of users

● Over a million topics

● ...

● The core data

● Search & the question namespace

● Discovery - Feed ranking

● Duplicate Questions deep dive

Agenda

The Core Data

Lots of data relations

Complex network propagation effects

Importance of topics & semantics

Search & the Question namespace

How we think of search

Ranking - Search ranking

● Match user queries to Quora entities

● Corpus: Quora questions, answers,

topics, users, blogs etc.

● Ranking: Traditional IR scores (e.g.

BM25), hand-tuned or ML-ranking

● Focus on long-term satisfaction

○ If a question exists, but the answer

is unsatisfactory, let the user

“Request Answers” for the question

Question Asking

Goal: Find the best people to answer a

question

● Understand the question

● Find people who can best answer the

question

● “Ask to Answer”: Route the question

to these people

● Either manual or automated A2A

Related Questions

• Given interest in question A (source) what other

questions will be interesting?

• Not only about similarity, but also “interestingness”

• Features such as:

• Textual

• Co-visit

• Topics

• …

• Important for logged-out use case

Duplicate Questions• Important issue for Quora

• Want to make sure we don’t disperse

knowledge to the same question

• Solution: binary classifier trained with

labelled data

• Features

• Textual vector space models

• Usage-based features

• etc.

• More on this later...

Discovery - Feed ranking

Ranking - Feed• Goal: Present most interesting stories for

a user at a given time• Interesting = topical relevance +

social relevance + timeliness

• Stories = questions + answers

• Relevance-ordered vs time-ordered = big

gains in engagement

• Challenges:

• potentially many candidate stories

• real-time ranking

• optimize for relevance

• Use Machine Learning for feed ranking

Feed dataset: impression logs

click

upvote

downvote

expand

share

click

answer pass

downvote

follow

● Value of showing a story to a user, e.g. weighted sum of actions:

v = ∑a va 1{ya = 1}

● Goal: predict this value for new stories. 2 possible approaches:○ predict value directly

v_pred = f(x)

■ pros: single regression model

■ cons: can be ambiguous, coupled

○ predict probabilities for each action, then compute expected value:

v_pred = E[ V | x ] = ∑a va p(a | x)

■ pros: better use of supervised signal, decouples action models from action values

■ cons: more costly, one classifier per action

What is relevance?

● Essential for getting good ranking

● Better if updated in real-time (more reactive)

● Main sets of features:○ user (e.g. age, country, recent activity)

○ story (e.g. popularity, trendiness, quality)

○ interactions between the two (e.g. topic or author affinity)

Feature engineering

● Linear

○ simple, fast to train

○ manual, non-linear transforms for richer

representation (buckets, ngrams)

● Decision trees

○ learn non-linear representations

● Tree ensembles

○ Random forests

○ Gradient boosted decision trees

● In-house C++ training code, third-party

libraries for prototyping new models

Models

Scalability: feed backend system

Aggregator 1 Aggregator 2 Aggregator 3

Leaf 1 Leaf 2 Leaf 3

Aggregator

Leaf

Requests from Web (python)

...

...

...

user_id

object_id

Other ML systems

● Answer ranking

● Answer collapsing

● User expertise

● Topics to follow

● Users to follow

● Spam content detection

● Malicious user / troll detection

● Trending topics

● …...

Duplicate Questions

Duplicate Questions

• Duplicate questions are a problem because:

• good writers don't want to answer the same question over and over

• they result in poor user experience

• they dilute the quality of the pages and relevance

• As we grow faster, we will have more duplicate questions.

Duplicate Questions: Product Challenges

• How to handle duplicate questions?

• Maybe delete them? Or Merge them?

• Which one should be the canonical question?

• Who should decide if the questions are duplicate?

• Should we trust the system?

• Should we empower users?

Duplicate Questions: Solution

● Based on the classifier, merge the new

& existing duplicate questions.

● Users can also (un)merge questions.

● Review every request.

● Merging questions.

Duplicate Questions:

Duplicate Questions: Technical Challenges

• The classifier must have high precision.

• The merge operation must be very quick.

• Most of the product changes with the Question merges.

• Conflicts in the (Un)Merge operations

• Merge Cycles

• Merge Chains

• Conflicting Merge & Unmerge

Duplicate Questions: Classifier

• Features for Question Pair:

• TF-IDF

• Word2vec

• Topic

• Binary classifier trained with labelled data

• Boosted decision tree

• Monitor the precision

• Periodically retrain the model to retain the high precision.

Duplicate Questions: Question Merging

• Make the content move extremely quick.

• Faster merges mean fewer race conditions.

• Execute tasks in parallel using multiple workers.

• Finite State Machine (FSM) for state tracking.

• Lock the critical section.

• Identify the conflicts & fail early.

Duplicate Questions: Solution

● Track state of the request.

● Review critical requests first.

● Block all the conflicts while

processing the request.

Questions?