23
How to prepare data for NLP Loryfel Nunez @lorynyc

PyParis2017 / How to prepare data for NLP, by Loryfel Nunez.pptx

Embed Size (px)

Citation preview

Page 1: PyParis2017 / How to prepare data for NLP, by Loryfel Nunez.pptx

How to prepare data for NLP

Loryfel Nunez

@lorynyc

Page 2: PyParis2017 / How to prepare data for NLP, by Loryfel Nunez.pptx

California Gold Rush

Page 3: PyParis2017 / How to prepare data for NLP, by Loryfel Nunez.pptx

Extracting actionable information

from modern big data sets requires the

equivalent processing infrastructure of

extracting a nugget of GOLD from a mountain of DIRT.

Nikolas Markou

(via LInkedIn)

Page 4: PyParis2017 / How to prepare data for NLP, by Loryfel Nunez.pptx

Have an intuition on how things work

Breaking data down

Keep it simple .. if possible

1

3

2

Page 5: PyParis2017 / How to prepare data for NLP, by Loryfel Nunez.pptx

How does it work,

anyway?1

Page 6: PyParis2017 / How to prepare data for NLP, by Loryfel Nunez.pptx

The General NLP Problem

dog: 3, 2, 1

red coat: 0, 0, 1

😋

😭

Page 7: PyParis2017 / How to prepare data for NLP, by Loryfel Nunez.pptx

Controlling the input

Document Unit

Representation of text

Page 8: PyParis2017 / How to prepare data for NLP, by Loryfel Nunez.pptx

Inside the Machine

Smith acquires shares of Novak and Kline for $10.99 per share .

Smith acquires shares of Novak and Kline for $10.99 per share .

Smith acquires shares of Novak and Kline for $10.99 per share .

Smith acquires shares of Novak and Kline for $10.99 per share .

Page 9: PyParis2017 / How to prepare data for NLP, by Loryfel Nunez.pptx

BREAK IT DOWN

2

Page 10: PyParis2017 / How to prepare data for NLP, by Loryfel Nunez.pptx

Let’s Break it Down

á NovákNovák and

KlineSmith acquires shares of Novak

and Kline for $10.99 per share.

Smith acquires shares of

Novak and Kline for $10.99 per

share.

Smith Inc. acquires shares of

Novak and Kline for $10.99 per

share.

Smith acquires common

shares of N & K for

$10.99/share.

Page 11: PyParis2017 / How to prepare data for NLP, by Loryfel Nunez.pptx

In the real world

<p><b>Smith Buys Novak</b></p>

<p></p>

<p>by Anna Smith<p>

<p> LONDON --- Smith Inc. acquires shares for Novak &amp; Kline. for

$10.99/share.</p>

<table style="width:100%">

<tr><th>Col1</th><th>Col2</th> </tr>

<tr><td>data1</td><td>data2</td></tr>

</table>

Page 12: PyParis2017 / How to prepare data for NLP, by Loryfel Nunez.pptx

… if possible

2

Page 13: PyParis2017 / How to prepare data for NLP, by Loryfel Nunez.pptx

Character

á

&amp;

Do you know the encoding of your input data?

◉User tells you

◉Metadata

◉Figure it out (using chardet, or similar)

◉Have your own heuristics

Page 14: PyParis2017 / How to prepare data for NLP, by Loryfel Nunez.pptx

Tokens

Forty-two, 42

Post-colonial, postcolonial

eBay, Ebay, EBAY, ebay

Fed, FED, fed

C.A.T., CAT

Heuristics

Mappings

Transformations

numToWord, POS (from

SpaCy or NLTK)

Page 15: PyParis2017 / How to prepare data for NLP, by Loryfel Nunez.pptx

Tokens

STEMMING vs LEMMATIZATION

import spacy

from nltk.stem.porter import PorterStemmer

nlp = spacy.load('en')

stemmer = PorterStemmer()

doc = nlp(u'She is an intelligence operative.')

for word in doc:

stemmed = stemmer.stem(word.text)

print(word.text, " LEMMA => ", word.lemma_, "

STEM => ", stemmed)

She LEMMA => -PRON- STEM => she

is LEMMA => be STEM => is

an LEMMA => an STEM => an

intelligence LEMMA => intelligence STEM => intellig

operative LEMMA => operative STEM => oper

. LEMMA => . STEM => .

SpaCy, NLTK

Page 16: PyParis2017 / How to prepare data for NLP, by Loryfel Nunez.pptx

Entities

Novak and Kline, NK,

NYSE:NK, Test Company

June 30, 2017

06/30/2017

30/6/2017

Smith acquires shares of Novak and Kline for

$10.99 per share .

Smith acquires shares of NK for $10.99 per

share .

ORG acquires shares of ORG for $10.99 per share

.

Page 17: PyParis2017 / How to prepare data for NLP, by Loryfel Nunez.pptx

Hot or Not

REMOVING HIGHLIGHTING

WORDS Emails, dates, URLs,

stop words

hotwords

More than WORDS tables Hot patterns

textacy

Page 18: PyParis2017 / How to prepare data for NLP, by Loryfel Nunez.pptx

In the real world

<p><b>Smith Buys Novak</b></p>

<p></p>

<p>by Anna Smith<p>

<p> LONDON --- Smith Inc. acquires shares for Novak &amp; Kline. for

$10.99/share.</p>

<table style="width:100%">

<tr><th>Col1</th><th>Col2</th> </tr>

<tr><td>data1</td><td>data2</td></tr>

</table>

Page 19: PyParis2017 / How to prepare data for NLP, by Loryfel Nunez.pptx

IRL

{‘title’: ‘Smith Buys …’,

‘original_text’: ‘LONDON --- Smith..’,

‘transformed_text’: {

‘text_with_entities’: ‘LOCATION – ORG acquired …. ‘,

‘lemmatized’: ‘Smith Inc acquire share..’

‘has_acquired: true

},

‘table’: ‘<table>….. </table>’

}

Page 20: PyParis2017 / How to prepare data for NLP, by Loryfel Nunez.pptx

The General NLP Problem

dog: 3, 2, 1

red coat: 0, 0, 1

😋

😭

Page 21: PyParis2017 / How to prepare data for NLP, by Loryfel Nunez.pptx

Have an intuition on how things work

Breaking data down

Keep it simple .. if possible

1

3

2

-- how algorithms see text

-- from bytes to documents

-- patterns, normalization, metadata, actions

(replace, remove, highlight)

Page 22: PyParis2017 / How to prepare data for NLP, by Loryfel Nunez.pptx

◉ Stanford NLP Group

◉ Spacy Documentation

◉ SciKit Learn Documentation

◉ The hard knocks of NLP projects

References and other stuff

Page 23: PyParis2017 / How to prepare data for NLP, by Loryfel Nunez.pptx

Any questions ?

You can find me at

◉ @lorynyc

[email protected]

Thanks!