Introduction to Machine Learning and Text Mining

Introduction to Machine Learning and Text Mining

Carolyn Penstein RoséLanguage Technologies Institute/

Human-Computer Interaction Institute

Naïve Approach: When all you have is a hammer…

TargetRepresentationData

Slightly less naïve approach: Aimless wandering…


Expert Approach: Hypothesis driven


Suggested Readings Witten, I. H., Frank, E., Hall,

M. (2011). Data Mining: Practical Machine Learning Tools and Techniques, third edition, Elsevier: San Francisco

What is machine learning?

Automatically or semi-automatically Inducing concepts (i.e., rules) from dataFinding patterns in dataExplaining dataMaking predictions

Data Learning Algorithm Model

New Data

PredictionClassification Engine

If Outlook = sunny, no else if Outlook = overcast, yes else if Outlook = rainy and Windy = TRUE, no else yes

Perfect ontraining data


Performance ontraining data?Not perfect on

testing data


IMPORTANT!If you evaluate the performanceof your rule on the same data

you trained on, you won’tget an accurate estimate of

how well it will do on new data.

Simple Cross Validation Let’s say your data has

attributes A, B, and C

You want to train a rule to predict D

First train on 2, 3, 4, 5, 6,7 and apply trained model to

1 The results is Accuracy1

1

2

3

4

5

6

7

TEST

TRAIN

TRAIN

TRAIN

TRAIN

TRAIN

TRAIN

Fold: 1






1

2

3

4

5

6

7

TRAIN

TRAIN

TRAIN

TRAIN

TRAIN

TRAIN

TEST

Fold: 2






1

2

3

4

5

6

7

TRAIN

TRAIN

TRAIN

TRAIN

TEST

TRAIN

TRAIN

Fold: 3




First train on 1,2, 3, 5, 6,7 and apply trained model to


1

2

3

4

5

6

7

TRAIN

TRAIN

TRAIN

TEST

TRAIN

TRAIN

TRAIN

Fold: 4






1

2

3

4

5

6

7

TRAIN

TRAIN

TEST

TRAIN

TRAIN

TRAIN

TRAIN

Fold: 5




First train on 1, 2, 3, 4, 5, 7 and apply trained model to


1

2

3

4

5

6

7

TRAIN

TEST

TRAIN

TRAIN

TRAIN

TRAIN

TRAIN

Fold: 6




First train on 1, 2, 3, 4, 5, 6 and apply trained model to 7 The results is Accuracy7 Finally: Average Accuracy1

through Accuracy7

1

2

3

4

5

6

7

TRAIN

TRAIN

TRAIN

TRAIN

TRAIN

TEST

TRAIN

Fold: 7

Working with Text

Basic Idea

Represent text as a vector where each position corresponds to a term

This is called the “bag of words” approach

Cows make cheese. 110010

Hamsters eat seeds. 001101

CheeseCowsEatHamstersMakeSeeds

Basic Idea

Represent text as a vector where each position corresponds to a term

This is called the “bag of words” approach

Cows make cheese.110010

Hamsters eat seeds.001101

CheeseCowsEatHamstersMakeSeeds

But same representationBut same representationfor “Cheese makes cows.”!for “Cheese makes cows.”!

Part of Speech Tagging1. CC Coordinating

conjunction 2. CD Cardinal number 3. DT Determiner 4. EX Existential there 5. FW Foreign word 6. IN Preposition/subord 7. JJ Adjective 8. JJR Adjective,

comparative 9. JJS Adjective, superlative 10.LS List item marker 11.MD Modal

12.NN Noun, singular or mass

13.NNS Noun, plural 14.NNP Proper noun,

singular 15.NNPS Proper noun, plural 16.PDT Predeterminer 17.POS Possessive ending 18.PRP Personal pronoun 19.PP Possessive pronoun 20.RB Adverb 21.RBR Adverb, comparative 22.RBS Adverb, superlative

http://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html


Part of Speech Tagging23.RP Particle 24.SYM Symbol 25.TO to 26.UH Interjection 27.VB Verb, base form 28.VBD Verb, past tense 29.VBG Verb,

gerund/present participle 30.VBN Verb, past participle 31.VBP Verb, non-3rd ps.

sing. present

32.VBZ Verb, 3rd ps. sing. present

33.WDT wh-determiner 34.WP wh-pronoun 35.WP Possessive wh-

pronoun 36.WRB wh-adverb



Basic Types of Features

Unigram Single words prefer, sandwhich, take

Bigram Pairs of words next to each other Machine_learning, eat_wheat

POS-Bigram Pairs of POS tags next to each other DT_NN, NNP_NNP

Keep this picture in mind…

Machine learning isn’t magic But it can be useful for

identifying meaningful patterns in your data when used properly

Proper use requires insight into your data

?

Documents

Introduction to Machine Learning and Text Mining