Upload
nikhil-dandekar
View
260
Download
0
Embed Size (px)
Citation preview
Search, Discovery and the Question namespace
Nikhil Dandekar (@nikhilbd) Sandeep Goyal
5/10/2016
About us
Nikhil Dandekar
● Engineering Manager: Search, Questions and Growth at Quora
● Previously: Foursquare, Bing
● Background in Search, Data, Machine Learning
Sandeep Goyal
● Software Engineer: Search, Questions and Growth at Quora.
● Previously: Playdom, Yahoo
● Background in Growth, Systems
Quora’s Mission
“To share and grow the world’s
knowledge”
● Millions of questions & answers
● Millions of users
● Over a million topics
● ...
● The core data
● Search & the question namespace
● Discovery - Feed ranking
● Duplicate Questions deep dive
Agenda
The Core Data
Lots of data relations
Complex network propagation effects
Importance of topics & semantics
Search & the Question namespace
How we think of search
Ranking - Search ranking
● Match user queries to Quora entities
● Corpus: Quora questions, answers,
topics, users, blogs etc.
● Ranking: Traditional IR scores (e.g.
BM25), hand-tuned or ML-ranking
● Focus on long-term satisfaction
○ If a question exists, but the answer
is unsatisfactory, let the user
“Request Answers” for the question
Question Asking
Goal: Find the best people to answer a
question
● Understand the question
● Find people who can best answer the
question
● “Ask to Answer”: Route the question
to these people
● Either manual or automated A2A
Related Questions
• Given interest in question A (source) what other
questions will be interesting?
• Not only about similarity, but also “interestingness”
• Features such as:
• Textual
• Co-visit
• Topics
• …
• Important for logged-out use case
Duplicate Questions• Important issue for Quora
• Want to make sure we don’t disperse
knowledge to the same question
• Solution: binary classifier trained with
labelled data
• Features
• Textual vector space models
• Usage-based features
• etc.
• More on this later...
Discovery - Feed ranking
Ranking - Feed• Goal: Present most interesting stories for
a user at a given time• Interesting = topical relevance +
social relevance + timeliness
• Stories = questions + answers
• Relevance-ordered vs time-ordered = big
gains in engagement
• Challenges:
• potentially many candidate stories
• real-time ranking
• optimize for relevance
• Use Machine Learning for feed ranking
Feed dataset: impression logs
click
upvote
downvote
expand
share
click
answer pass
downvote
follow
● Value of showing a story to a user, e.g. weighted sum of actions:
v = ∑a va 1{ya = 1}
● Goal: predict this value for new stories. 2 possible approaches:○ predict value directly
v_pred = f(x)
■ pros: single regression model
■ cons: can be ambiguous, coupled
○ predict probabilities for each action, then compute expected value:
v_pred = E[ V | x ] = ∑a va p(a | x)
■ pros: better use of supervised signal, decouples action models from action values
■ cons: more costly, one classifier per action
What is relevance?
● Essential for getting good ranking
● Better if updated in real-time (more reactive)
● Main sets of features:○ user (e.g. age, country, recent activity)
○ story (e.g. popularity, trendiness, quality)
○ interactions between the two (e.g. topic or author affinity)
Feature engineering
● Linear
○ simple, fast to train
○ manual, non-linear transforms for richer
representation (buckets, ngrams)
● Decision trees
○ learn non-linear representations
● Tree ensembles
○ Random forests
○ Gradient boosted decision trees
● In-house C++ training code, third-party
libraries for prototyping new models
Models
Scalability: feed backend system
Aggregator 1 Aggregator 2 Aggregator 3
Leaf 1 Leaf 2 Leaf 3
Aggregator
Leaf
Requests from Web (python)
...
...
...
user_id
object_id
Other ML systems
● Answer ranking
● Answer collapsing
● User expertise
● Topics to follow
● Users to follow
● Spam content detection
● Malicious user / troll detection
● Trending topics
● …...
Duplicate Questions
Duplicate Questions
• Duplicate questions are a problem because:
• good writers don't want to answer the same question over and over
• they result in poor user experience
• they dilute the quality of the pages and relevance
• As we grow faster, we will have more duplicate questions.
Duplicate Questions: Product Challenges
• How to handle duplicate questions?
• Maybe delete them? Or Merge them?
• Which one should be the canonical question?
• Who should decide if the questions are duplicate?
• Should we trust the system?
• Should we empower users?
Duplicate Questions: Solution
● Based on the classifier, merge the new
& existing duplicate questions.
● Users can also (un)merge questions.
● Review every request.
● Merging questions.
Duplicate Questions:
Duplicate Questions: Technical Challenges
• The classifier must have high precision.
• The merge operation must be very quick.
• Most of the product changes with the Question merges.
• Conflicts in the (Un)Merge operations
• Merge Cycles
• Merge Chains
• Conflicting Merge & Unmerge
Duplicate Questions: Classifier
• Features for Question Pair:
• TF-IDF
• Word2vec
• Topic
• Binary classifier trained with labelled data
• Boosted decision tree
• Monitor the precision
• Periodically retrain the model to retain the high precision.
Duplicate Questions: Question Merging
• Make the content move extremely quick.
• Faster merges mean fewer race conditions.
• Execute tasks in parallel using multiple workers.
• Finite State Machine (FSM) for state tracking.
• Lock the critical section.
• Identify the conflicts & fail early.
Duplicate Questions: Solution
● Track state of the request.
● Review critical requests first.
● Block all the conflicts while
processing the request.
Questions?