Upload
pole-systematic-paris-region
View
99
Download
0
Embed Size (px)
Citation preview
How to prepare data for NLP
Loryfel Nunez
@lorynyc
California Gold Rush
“
Extracting actionable information
from modern big data sets requires the
equivalent processing infrastructure of
extracting a nugget of GOLD from a mountain of DIRT.
Nikolas Markou
(via LInkedIn)
Have an intuition on how things work
Breaking data down
Keep it simple .. if possible
1
3
2
How does it work,
anyway?1
The General NLP Problem
dog: 3, 2, 1
red coat: 0, 0, 1
😋
😭
Controlling the input
Document Unit
Representation of text
Inside the Machine
Smith acquires shares of Novak and Kline for $10.99 per share .
Smith acquires shares of Novak and Kline for $10.99 per share .
Smith acquires shares of Novak and Kline for $10.99 per share .
Smith acquires shares of Novak and Kline for $10.99 per share .
BREAK IT DOWN
2
Let’s Break it Down
á NovákNovák and
KlineSmith acquires shares of Novak
and Kline for $10.99 per share.
Smith acquires shares of
Novak and Kline for $10.99 per
share.
Smith Inc. acquires shares of
Novak and Kline for $10.99 per
share.
Smith acquires common
shares of N & K for
$10.99/share.
In the real world
<p><b>Smith Buys Novak</b></p>
<p></p>
<p>by Anna Smith<p>
<p> LONDON --- Smith Inc. acquires shares for Novak & Kline. for
$10.99/share.</p>
<table style="width:100%">
<tr><th>Col1</th><th>Col2</th> </tr>
<tr><td>data1</td><td>data2</td></tr>
</table>
… if possible
2
Character
á
&
Do you know the encoding of your input data?
◉User tells you
◉Metadata
◉Figure it out (using chardet, or similar)
◉Have your own heuristics
Tokens
Forty-two, 42
Post-colonial, postcolonial
eBay, Ebay, EBAY, ebay
Fed, FED, fed
C.A.T., CAT
Heuristics
Mappings
Transformations
numToWord, POS (from
SpaCy or NLTK)
Tokens
STEMMING vs LEMMATIZATION
import spacy
from nltk.stem.porter import PorterStemmer
nlp = spacy.load('en')
stemmer = PorterStemmer()
doc = nlp(u'She is an intelligence operative.')
for word in doc:
stemmed = stemmer.stem(word.text)
print(word.text, " LEMMA => ", word.lemma_, "
STEM => ", stemmed)
She LEMMA => -PRON- STEM => she
is LEMMA => be STEM => is
an LEMMA => an STEM => an
intelligence LEMMA => intelligence STEM => intellig
operative LEMMA => operative STEM => oper
. LEMMA => . STEM => .
SpaCy, NLTK
Entities
Novak and Kline, NK,
NYSE:NK, Test Company
June 30, 2017
06/30/2017
30/6/2017
Smith acquires shares of Novak and Kline for
$10.99 per share .
Smith acquires shares of NK for $10.99 per
share .
ORG acquires shares of ORG for $10.99 per share
.
Hot or Not
REMOVING HIGHLIGHTING
WORDS Emails, dates, URLs,
stop words
hotwords
More than WORDS tables Hot patterns
textacy
In the real world
<p><b>Smith Buys Novak</b></p>
<p></p>
<p>by Anna Smith<p>
<p> LONDON --- Smith Inc. acquires shares for Novak & Kline. for
$10.99/share.</p>
<table style="width:100%">
<tr><th>Col1</th><th>Col2</th> </tr>
<tr><td>data1</td><td>data2</td></tr>
</table>
IRL
{‘title’: ‘Smith Buys …’,
‘original_text’: ‘LONDON --- Smith..’,
‘transformed_text’: {
‘text_with_entities’: ‘LOCATION – ORG acquired …. ‘,
‘lemmatized’: ‘Smith Inc acquire share..’
‘has_acquired: true
},
‘table’: ‘<table>….. </table>’
}
The General NLP Problem
dog: 3, 2, 1
red coat: 0, 0, 1
😋
😭
Have an intuition on how things work
Breaking data down
Keep it simple .. if possible
1
3
2
-- how algorithms see text
-- from bytes to documents
-- patterns, normalization, metadata, actions
(replace, remove, highlight)
◉ Stanford NLP Group
◉ Spacy Documentation
◉ SciKit Learn Documentation
◉ The hard knocks of NLP projects
References and other stuff