PyParis2017 / How to prepare data for NLP, by Loryfel Nunez.pptx

How to prepare data for NLP

Loryfel Nunez

@lorynyc

California Gold Rush

“

Extracting actionable information

from modern big data sets requires the

equivalent processing infrastructure of

extracting a nugget of GOLD from a mountain of DIRT.

Nikolas Markou

(via LInkedIn)

Have an intuition on how things work

Breaking data down

Keep it simple .. if possible

1

3

2

How does it work,

anyway?1

The General NLP Problem

dog: 3, 2, 1

red coat: 0, 0, 1

😋

😭

Controlling the input

Document Unit

Representation of text

Inside the Machine

Smith acquires shares of Novak and Kline for $10.99 per share .




BREAK IT DOWN

2

Let’s Break it Down

á NovákNovák and

KlineSmith acquires shares of Novak

and Kline for $10.99 per share.

Smith acquires shares of

Novak and Kline for $10.99 per

share.

Smith Inc. acquires shares of

Novak and Kline for $10.99 per

share.

Smith acquires common

shares of N & K for

$10.99/share.

In the real world

<p><b>Smith Buys Novak</b></p>

<p></p>

<p>by Anna Smith<p>

<p> LONDON --- Smith Inc. acquires shares for Novak & Kline. for

$10.99/share.</p>

<table style="width:100%">

<tr><th>Col1</th><th>Col2</th> </tr>

<tr><td>data1</td><td>data2</td></tr>

</table>

… if possible

2

Character

á

&

Do you know the encoding of your input data?

◉User tells you

◉Metadata

◉Figure it out (using chardet, or similar)

◉Have your own heuristics

Tokens

Forty-two, 42

Post-colonial, postcolonial

eBay, Ebay, EBAY, ebay

Fed, FED, fed

C.A.T., CAT

Heuristics

Mappings

Transformations

numToWord, POS (from

SpaCy or NLTK)

Tokens

STEMMING vs LEMMATIZATION

import spacy

from nltk.stem.porter import PorterStemmer

nlp = spacy.load('en')

stemmer = PorterStemmer()

doc = nlp(u'She is an intelligence operative.')

for word in doc:

stemmed = stemmer.stem(word.text)

print(word.text, " LEMMA => ", word.lemma_, "

STEM => ", stemmed)

She LEMMA => -PRON- STEM => she

is LEMMA => be STEM => is

an LEMMA => an STEM => an

intelligence LEMMA => intelligence STEM => intellig

operative LEMMA => operative STEM => oper

. LEMMA => . STEM => .

SpaCy, NLTK

Entities

Novak and Kline, NK,

NYSE:NK, Test Company

June 30, 2017

06/30/2017

30/6/2017

Smith acquires shares of Novak and Kline for

$10.99 per share .

Smith acquires shares of NK for $10.99 per

share .

ORG acquires shares of ORG for $10.99 per share

.

Hot or Not

REMOVING HIGHLIGHTING

WORDS Emails, dates, URLs,

stop words

hotwords

More than WORDS tables Hot patterns

textacy

In the real world

<p><b>Smith Buys Novak</b></p>

<p></p>

<p>by Anna Smith<p>

<p> LONDON --- Smith Inc. acquires shares for Novak & Kline. for

$10.99/share.</p>

<table style="width:100%">

<tr><th>Col1</th><th>Col2</th> </tr>

<tr><td>data1</td><td>data2</td></tr>

</table>

IRL

{‘title’: ‘Smith Buys …’,

‘original_text’: ‘LONDON --- Smith..’,

‘transformed_text’: {

‘text_with_entities’: ‘LOCATION – ORG acquired …. ‘,

‘lemmatized’: ‘Smith Inc acquire share..’

‘has_acquired: true

},

‘table’: ‘<table>….. </table>’

}

The General NLP Problem

dog: 3, 2, 1

red coat: 0, 0, 1

😋

😭

Have an intuition on how things work

Breaking data down

Keep it simple .. if possible

1

3

2

-- how algorithms see text

-- from bytes to documents

-- patterns, normalization, metadata, actions

(replace, remove, highlight)

◉ Stanford NLP Group

◉ Spacy Documentation

◉ SciKit Learn Documentation

◉ The hard knocks of NLP projects

References and other stuff

Any questions ?

You can find me at

◉ @lorynyc

◉ [email protected]

Thanks!

Technology

PyParis2017 / How to prepare data for NLP, by Loryfel Nunez.pptx