View
13
Download
0
Category
Preview:
Citation preview
Extracting keywords from text
Lavanya Sharan Insight Data Science Fellow
with
matching ads to content
keywords
keywords
matching ad
keywords
matching ad
keywords
matching ad
?
keywords
matching ad
?
keywords
matching ad
?
keywords
matching ad
?
A machine learning model for keyword extraction
Content Candidate keywords
Feature extraction
Keyword Classifier Keywords
A machine learning model for keyword extraction
Content Candidate keywords
Feature extraction
Keyword Classifier Keywords
A machine learning model for keyword extraction
Content Candidatekeywords
Feature extraction
Keyword Classifier Keywords
Brooklyn
A machine learning model for keyword extraction
Content Candidate keywords
Featureextraction
Keyword Classifier Keywords
Term length Wikipedia freq TF-‐IDF score ...
Brooklyn
A machine learning model for keyword extraction
Content Candidate keywords
Feature extraction
KeywordClassifier Keywords
Term length Wikipedia freq TF-‐IDF score ...
Brooklyn Logistic regression
A machine learning model for keyword extraction
Crowd500500 news articles
Human-annotated keywords 9:1 training-test split
Content Candidate keywords
Feature extraction
KeywordClassifier Keywords
Term length Wikipedia freq TF-‐IDF score ...
Brooklyn Logistic regression
A machine learning model for keyword extraction
Content Candidate keywords
Feature extraction
KeywordClassifier Keywords
Term length Wikipedia freq TF-‐IDF score ...
Brooklyn Logistic regression
A machine learning model for keyword extraction
P(keyword) = 0.81
Content Candidate keywords
Feature extraction
Keyword Classifier Keywords
Term length Wikipedia freq TF-‐IDF score ...
Brooklyn Logistic regression
A machine learning model for keyword extraction
P(keyword) = 0.81
Content Candidate keywords
Feature extraction
Keyword Classifier Keywords
Term length Wikipedia freq TF-‐IDF score ...
Brooklyn Logistic regression
A machine learning model for keyword extraction
P(keyword) = 0.81
nltk
Content Candidate keywords
Feature extraction
Keyword Classifier Keywords
Term length Wikipedia freq TF-‐IDF score ...
Brooklyn Logistic regression
State-of-the-art performance on Crowd500
About me
glass
Keywords: Lavanya, image recognition, foodie
Keyword classifier
“Brooklyn”
Term frequency TF-IDF score Wikipedia frequency
Term length Capitalized?
Position in page Spread in page
Named entity? Noun phrase? Ngram?
Logistic regression model
In-sample: 65%, out-of-sample: 65%, chance: 50%
Simple heuristics beat complex features
All stages of model required for performance
Beats AlchemyAPI for a range of parameters
Beats AlchemyAPI for a range of parameters
Recommended