Extracting keywords from text - MIT CSAIL · Extracting keywords from text Lavanya Sharan Insight...

Preview:

Citation preview

Extracting keywords from text

Lavanya Sharan Insight Data Science Fellow

with

matching ads to content

keywords

keywords

matching ad

keywords

matching ad

keywords

matching ad

?

keywords

matching ad

?

keywords

matching ad

?

keywords

matching ad

?

A machine learning model for keyword extraction

Content Candidate keywords

Feature extraction

Keyword Classifier Keywords

A machine learning model for keyword extraction

Content Candidate keywords

Feature extraction

Keyword Classifier Keywords

A machine learning model for keyword extraction

Content Candidatekeywords

Feature extraction

Keyword Classifier Keywords

Brooklyn

A machine learning model for keyword extraction

Content Candidate keywords

Featureextraction

Keyword Classifier Keywords

Term  length  Wikipedia  freq  TF-­‐IDF  score  ...

Brooklyn

A machine learning model for keyword extraction

Content Candidate keywords

Feature extraction

KeywordClassifier Keywords

Term  length  Wikipedia  freq  TF-­‐IDF  score  ...

Brooklyn Logistic regression

A machine learning model for keyword extraction

Crowd500500 news articles

Human-annotated keywords 9:1 training-test split

Content Candidate keywords

Feature extraction

KeywordClassifier Keywords

Term  length  Wikipedia  freq  TF-­‐IDF  score  ...

Brooklyn Logistic regression

A machine learning model for keyword extraction

Content Candidate keywords

Feature extraction

KeywordClassifier Keywords

Term  length  Wikipedia  freq  TF-­‐IDF  score  ...

Brooklyn Logistic regression

A machine learning model for keyword extraction

P(keyword) = 0.81

Content Candidate keywords

Feature extraction

Keyword Classifier Keywords

Term  length  Wikipedia  freq  TF-­‐IDF  score  ...

Brooklyn Logistic regression

A machine learning model for keyword extraction

P(keyword) = 0.81

Content Candidate keywords

Feature extraction

Keyword Classifier Keywords

Term  length  Wikipedia  freq  TF-­‐IDF  score  ...

Brooklyn Logistic regression

A machine learning model for keyword extraction

P(keyword) = 0.81

nltk

Content Candidate keywords

Feature extraction

Keyword Classifier Keywords

Term  length  Wikipedia  freq  TF-­‐IDF  score  ...

Brooklyn Logistic regression

State-of-the-art performance on Crowd500

About me

glass

Keywords: Lavanya, image recognition, foodie

Keyword classifier

“Brooklyn”

Term frequency TF-IDF score Wikipedia frequency

Term length Capitalized?

Position in page Spread in page

Named entity? Noun phrase? Ngram?

Logistic regression model

In-sample: 65%, out-of-sample: 65%, chance: 50%

Simple heuristics beat complex features

All stages of model required for performance

Beats AlchemyAPI for a range of parameters

Beats AlchemyAPI for a range of parameters

Recommended