33
Extracting keywords from text Lavanya Sharan Insight Data Science Fellow with

Extracting keywords from text - MIT CSAIL · Extracting keywords from text Lavanya Sharan Insight Data Science Fellow with. matching ads to content. keywords. keywords matching ad

  • Upload
    others

  • View
    13

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Extracting keywords from text - MIT CSAIL · Extracting keywords from text Lavanya Sharan Insight Data Science Fellow with. matching ads to content. keywords. keywords matching ad

Extracting keywords from text

Lavanya Sharan Insight Data Science Fellow

with

Page 2: Extracting keywords from text - MIT CSAIL · Extracting keywords from text Lavanya Sharan Insight Data Science Fellow with. matching ads to content. keywords. keywords matching ad
Page 3: Extracting keywords from text - MIT CSAIL · Extracting keywords from text Lavanya Sharan Insight Data Science Fellow with. matching ads to content. keywords. keywords matching ad

matching ads to content

Page 4: Extracting keywords from text - MIT CSAIL · Extracting keywords from text Lavanya Sharan Insight Data Science Fellow with. matching ads to content. keywords. keywords matching ad
Page 5: Extracting keywords from text - MIT CSAIL · Extracting keywords from text Lavanya Sharan Insight Data Science Fellow with. matching ads to content. keywords. keywords matching ad

keywords

Page 6: Extracting keywords from text - MIT CSAIL · Extracting keywords from text Lavanya Sharan Insight Data Science Fellow with. matching ads to content. keywords. keywords matching ad

keywords

matching ad

Page 7: Extracting keywords from text - MIT CSAIL · Extracting keywords from text Lavanya Sharan Insight Data Science Fellow with. matching ads to content. keywords. keywords matching ad

keywords

matching ad

Page 8: Extracting keywords from text - MIT CSAIL · Extracting keywords from text Lavanya Sharan Insight Data Science Fellow with. matching ads to content. keywords. keywords matching ad

keywords

matching ad

?

Page 9: Extracting keywords from text - MIT CSAIL · Extracting keywords from text Lavanya Sharan Insight Data Science Fellow with. matching ads to content. keywords. keywords matching ad

keywords

matching ad

?

Page 10: Extracting keywords from text - MIT CSAIL · Extracting keywords from text Lavanya Sharan Insight Data Science Fellow with. matching ads to content. keywords. keywords matching ad

keywords

matching ad

?

Page 11: Extracting keywords from text - MIT CSAIL · Extracting keywords from text Lavanya Sharan Insight Data Science Fellow with. matching ads to content. keywords. keywords matching ad

keywords

matching ad

?

Page 12: Extracting keywords from text - MIT CSAIL · Extracting keywords from text Lavanya Sharan Insight Data Science Fellow with. matching ads to content. keywords. keywords matching ad

A machine learning model for keyword extraction

Content Candidate keywords

Feature extraction

Keyword Classifier Keywords

Page 13: Extracting keywords from text - MIT CSAIL · Extracting keywords from text Lavanya Sharan Insight Data Science Fellow with. matching ads to content. keywords. keywords matching ad

A machine learning model for keyword extraction

Content Candidate keywords

Feature extraction

Keyword Classifier Keywords

Page 14: Extracting keywords from text - MIT CSAIL · Extracting keywords from text Lavanya Sharan Insight Data Science Fellow with. matching ads to content. keywords. keywords matching ad

A machine learning model for keyword extraction

Content Candidatekeywords

Feature extraction

Keyword Classifier Keywords

Brooklyn

Page 15: Extracting keywords from text - MIT CSAIL · Extracting keywords from text Lavanya Sharan Insight Data Science Fellow with. matching ads to content. keywords. keywords matching ad

A machine learning model for keyword extraction

Content Candidate keywords

Featureextraction

Keyword Classifier Keywords

Term  length  Wikipedia  freq  TF-­‐IDF  score  ...

Brooklyn

Page 16: Extracting keywords from text - MIT CSAIL · Extracting keywords from text Lavanya Sharan Insight Data Science Fellow with. matching ads to content. keywords. keywords matching ad

A machine learning model for keyword extraction

Content Candidate keywords

Feature extraction

KeywordClassifier Keywords

Term  length  Wikipedia  freq  TF-­‐IDF  score  ...

Brooklyn Logistic regression

Page 17: Extracting keywords from text - MIT CSAIL · Extracting keywords from text Lavanya Sharan Insight Data Science Fellow with. matching ads to content. keywords. keywords matching ad

A machine learning model for keyword extraction

Crowd500500 news articles

Human-annotated keywords 9:1 training-test split

Content Candidate keywords

Feature extraction

KeywordClassifier Keywords

Term  length  Wikipedia  freq  TF-­‐IDF  score  ...

Brooklyn Logistic regression

Page 18: Extracting keywords from text - MIT CSAIL · Extracting keywords from text Lavanya Sharan Insight Data Science Fellow with. matching ads to content. keywords. keywords matching ad

A machine learning model for keyword extraction

Content Candidate keywords

Feature extraction

KeywordClassifier Keywords

Term  length  Wikipedia  freq  TF-­‐IDF  score  ...

Brooklyn Logistic regression

Page 19: Extracting keywords from text - MIT CSAIL · Extracting keywords from text Lavanya Sharan Insight Data Science Fellow with. matching ads to content. keywords. keywords matching ad

A machine learning model for keyword extraction

P(keyword) = 0.81

Content Candidate keywords

Feature extraction

Keyword Classifier Keywords

Term  length  Wikipedia  freq  TF-­‐IDF  score  ...

Brooklyn Logistic regression

Page 20: Extracting keywords from text - MIT CSAIL · Extracting keywords from text Lavanya Sharan Insight Data Science Fellow with. matching ads to content. keywords. keywords matching ad

A machine learning model for keyword extraction

P(keyword) = 0.81

Content Candidate keywords

Feature extraction

Keyword Classifier Keywords

Term  length  Wikipedia  freq  TF-­‐IDF  score  ...

Brooklyn Logistic regression

Page 21: Extracting keywords from text - MIT CSAIL · Extracting keywords from text Lavanya Sharan Insight Data Science Fellow with. matching ads to content. keywords. keywords matching ad

A machine learning model for keyword extraction

P(keyword) = 0.81

nltk

Content Candidate keywords

Feature extraction

Keyword Classifier Keywords

Term  length  Wikipedia  freq  TF-­‐IDF  score  ...

Brooklyn Logistic regression

Page 22: Extracting keywords from text - MIT CSAIL · Extracting keywords from text Lavanya Sharan Insight Data Science Fellow with. matching ads to content. keywords. keywords matching ad
Page 23: Extracting keywords from text - MIT CSAIL · Extracting keywords from text Lavanya Sharan Insight Data Science Fellow with. matching ads to content. keywords. keywords matching ad

State-of-the-art performance on Crowd500

Page 24: Extracting keywords from text - MIT CSAIL · Extracting keywords from text Lavanya Sharan Insight Data Science Fellow with. matching ads to content. keywords. keywords matching ad
Page 25: Extracting keywords from text - MIT CSAIL · Extracting keywords from text Lavanya Sharan Insight Data Science Fellow with. matching ads to content. keywords. keywords matching ad
Page 26: Extracting keywords from text - MIT CSAIL · Extracting keywords from text Lavanya Sharan Insight Data Science Fellow with. matching ads to content. keywords. keywords matching ad
Page 27: Extracting keywords from text - MIT CSAIL · Extracting keywords from text Lavanya Sharan Insight Data Science Fellow with. matching ads to content. keywords. keywords matching ad
Page 28: Extracting keywords from text - MIT CSAIL · Extracting keywords from text Lavanya Sharan Insight Data Science Fellow with. matching ads to content. keywords. keywords matching ad

About me

glass

Keywords: Lavanya, image recognition, foodie

Page 29: Extracting keywords from text - MIT CSAIL · Extracting keywords from text Lavanya Sharan Insight Data Science Fellow with. matching ads to content. keywords. keywords matching ad

Keyword classifier

“Brooklyn”

Term frequency TF-IDF score Wikipedia frequency

Term length Capitalized?

Position in page Spread in page

Named entity? Noun phrase? Ngram?

Logistic regression model

In-sample: 65%, out-of-sample: 65%, chance: 50%

Page 30: Extracting keywords from text - MIT CSAIL · Extracting keywords from text Lavanya Sharan Insight Data Science Fellow with. matching ads to content. keywords. keywords matching ad

Simple heuristics beat complex features

Page 31: Extracting keywords from text - MIT CSAIL · Extracting keywords from text Lavanya Sharan Insight Data Science Fellow with. matching ads to content. keywords. keywords matching ad

All stages of model required for performance

Page 32: Extracting keywords from text - MIT CSAIL · Extracting keywords from text Lavanya Sharan Insight Data Science Fellow with. matching ads to content. keywords. keywords matching ad

Beats AlchemyAPI for a range of parameters

Page 33: Extracting keywords from text - MIT CSAIL · Extracting keywords from text Lavanya Sharan Insight Data Science Fellow with. matching ads to content. keywords. keywords matching ad

Beats AlchemyAPI for a range of parameters