Lei Yang, Senior Engineering Manager, Quora at MLconf NYC - 4/15/16

Preview:

Citation preview

Sharing and growing the world's knowledge with machine learning

Lei Yang (leiyang@quora.com)

April 2016

Our mission

“To share and grow the world’s

knowledge”

● Millions of questions & answers

● Millions of users

● Thousands of topics

● ...

Demand

What we care about

Quality

Relevance

Data@Quora

Topic Question

User

Answer

Actions

Lots of data relations

Complex network propagation effects

Importance of topics & semantics

Machine Learning@Quora

Ranking - Answer ranking

What is a good Quora answer?

● Truthful

● Reusable

● Provides explanation

● well formatted

...

Ranking - Answer ranking

How are those criteria translated

into features?

● Features that relate to the text quality

itself

● Interaction features (upvotes/downvotes,

clicks, comments…)

● User features (e.g. expertise in topic)

Ranking - Feed

Present most interesting stories for a user at

a given time

● Interesting = topical relevance +

social relevance + timeliness

● Stories = questions + answers

● Personalized learning-to-rank approach

● Relevance-ordered vs time-ordered = big

gains in engagement

● Challenges

○ Potentially many candidate stories

○ Real-time ranking

○ Objective function

Ranking - Feed

● Personalized LTR model

● Features

○ Quality of question/answer

○ Topics the user is interested in

or knows about

○ Users the user is following

○ What is trending/popular

○ ...

● Different temporal windows

● Multi-stage solution with different

“streams”

Recommendations - Topics

Recommend new topics for the user

to follow, based on

● Topics you already follow

● Users you already follow

● Interactions with questions/answers

● Topic-related features

● ...

Recommendations - Users

Recommend new users for the user

to follow, based on:

● Users you already follow

● Topics you already follow

● Interactions with users

● User-related features

● ...

Related questions

Given interest in a question, what other questions

are interesting?

● Not only about similarity, but also “interestingness”

● Features such as:

○ Textual

○ Co-visit

○ Topics

○ …

● Important for logged-out use case

Duplicate questions

● Important issue for Quora

○ Want to make sure we don’t disperse

knowledge to the same question

● Binary classifier trained with labelled data

● Features

○ Textual vector space models

○ Usage-based features

○ ...

User expertise inference

Infer user’s trustworthiness in relation

to a given topic

● We take into account:

○ Answers written on topic

○ Upvotes/downvotes received

○ Endorsements

○ ...

● Trust/expertise propagates through the network

● Useful as input/features in other models

Spam detection and moderation

● Very important for Quora to keep quality of

content

● Pure manual approaches do not scale

● Hard to get algorithms 100% right

● ML algorithms detect content/user issues

○ Output of the algorithms feed manually

curated moderation queues

Content creation prediction

● Quora’s algorithms not only optimize for

probability of reading

● Important to predict probability of a user

answering a question

● Some product features completely rely

on that prediction

○ E.g. A2A (ask to answer) suggestions

Trending topics

Highlight current events that are interesting

to the user

● We take into account:

○ Global “Trendiness”

○ Social “Trendiness”

○ User’s interest

○ ...

● Trending topics are a great discovery mechanism

Models &Experimentation

Models

● Logistic Regression

● Elastic Nets

● Gradient Boosted Decision Trees

● Random Forests

● (Deep) Neural Networks

● LambdaMART

● Matrix Factorization

● LDA

● ...

Open source project -- QMF

Quora Matrix Factorization

https://github.com/quora/qmf

● Currently BPR and WALS

● Multithreaded implementation

in C++14

ML platform

● Allow ML Engineers and Data

Scientists to collaborate within

the same ML framework

● Easy integration with well known

tools and open source libraries

● Offline evaluation and debugging

● User friendly Python frontend

● High performance and scalable

C++/CUDA backend

Redshift MySQL

S3 PythonUser Interface

Trainer Box

Session

CPU GPU

Disk

...WALS BPR

● Extensive A/B testing, data-driven

decision-making

● Separate, orthogonal “layers” for

different parts of the system

● Experiment framework showing

comparisons for various metrics

Experimentation

Conclusions

Conclusions

● At Quora we have not only Big, but also “rich” data

● Our algorithms need to understand and optimize complex aspects such

as quality, interestingness, relevance, or user expertise

● We believe ML will be one of the keys to our success

● We have many interesting problems, and many unsolved challenges

We are hiring! www.quora.com/careers