82
Natural Language Processing Machine Learning Potsdam, 26 April 2012 Saeedeh Momtazi Information Systems Group

Natural Language Processing - Hasso Plattner …Natural Language Processing Machine Learning Potsdam, 26 April 2012 Saeedeh Momtazi Information Systems Group Introduction Machine Learning

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Natural Language Processing - Hasso Plattner …Natural Language Processing Machine Learning Potsdam, 26 April 2012 Saeedeh Momtazi Information Systems Group Introduction Machine Learning

Natural Language ProcessingMachine Learning

Potsdam, 26 April 2012

Saeedeh MomtaziInformation Systems Group

Page 2: Natural Language Processing - Hasso Plattner …Natural Language Processing Machine Learning Potsdam, 26 April 2012 Saeedeh Momtazi Information Systems Group Introduction Machine Learning

Introduction

Machine LearningField of study that gives computers the ability to learn withoutbeing explicitly programmed.

[Arthur Samuel, 1959]

Learning MethodsSupervised learning

Active learning

Unsupervised learningSemi-supervised learningReinforcement learning

Saeedeh Momtazi | NLP | 26.04.2012

2

Page 3: Natural Language Processing - Hasso Plattner …Natural Language Processing Machine Learning Potsdam, 26 April 2012 Saeedeh Momtazi Information Systems Group Introduction Machine Learning

Outline

1 Supervised Learning

2 Semi-Supervised Learning

3 Unsupervised Learning

Saeedeh Momtazi | NLP | 26.04.2012

3

Page 4: Natural Language Processing - Hasso Plattner …Natural Language Processing Machine Learning Potsdam, 26 April 2012 Saeedeh Momtazi Information Systems Group Introduction Machine Learning

Outline

1 Supervised Learning

2 Semi-Supervised Learning

3 Unsupervised Learning

Saeedeh Momtazi | NLP | 26.04.2012

4

Page 5: Natural Language Processing - Hasso Plattner …Natural Language Processing Machine Learning Potsdam, 26 April 2012 Saeedeh Momtazi Information Systems Group Introduction Machine Learning

Supervised Learning

Renting budget: 1000 e

Size: 180 m2

Age: 2 years 7

Saeedeh Momtazi | NLP | 26.04.2012

5

Page 6: Natural Language Processing - Hasso Plattner …Natural Language Processing Machine Learning Potsdam, 26 April 2012 Saeedeh Momtazi Information Systems Group Introduction Machine Learning

Supervised Learning

size

age

4

7?7

44

44

4

7

7

7

7

7

Saeedeh Momtazi | NLP | 26.04.2012

6

Page 7: Natural Language Processing - Hasso Plattner …Natural Language Processing Machine Learning Potsdam, 26 April 2012 Saeedeh Momtazi Information Systems Group Introduction Machine Learning

Classification

Training

T1 → C1T2 → C2...

Tn → Cn

−−−−→

F1F2...

Fn

−−−−→�� ��Model(F,C)

Testing

Tn+1 →? −−−−→ Fn+1 −−−−→

←−−−−−−−−

Cn+1

Saeedeh Momtazi | NLP | 26.04.2012

7

Page 8: Natural Language Processing - Hasso Plattner …Natural Language Processing Machine Learning Potsdam, 26 April 2012 Saeedeh Momtazi Information Systems Group Introduction Machine Learning

Applications

Problem Item Category

POS Tagging Word POSNamed Entity Recognition Word Named entityWord Sense Disambiguation Word The word’s senseSpam Mail Detection Document Spam/Not spamLanguage Identification Document LanguageText Categorization Document TopicInformation Retrieval Document Relevant/Not relevant

Saeedeh Momtazi | NLP | 26.04.2012

8

Page 9: Natural Language Processing - Hasso Plattner …Natural Language Processing Machine Learning Potsdam, 26 April 2012 Saeedeh Momtazi Information Systems Group Introduction Machine Learning

Part Of Speech Tagging

“I saw the man on the roof.”

“ I[PRON] saw[V ] the[DET ] man[N] on[PREP] the[DET ] roof[N]. ”

[PRON] Pronoun[PREP] Preposition[DET] Determiner[V] Verb[N] Noun...

Saeedeh Momtazi | NLP | 26.04.2012

9

Page 10: Natural Language Processing - Hasso Plattner …Natural Language Processing Machine Learning Potsdam, 26 April 2012 Saeedeh Momtazi Information Systems Group Introduction Machine Learning

Named Entity Recognition

“Steven Paul Jobs, co-founder of Apple Inc, was born in California.”

“ Steven Paul JobsPerson

, co-founder of Apple IncOrganization

, was born in CaliforniaLocation

.”

PersonOrganizationLocationDate...

Saeedeh Momtazi | NLP | 26.04.2012

10

Page 11: Natural Language Processing - Hasso Plattner …Natural Language Processing Machine Learning Potsdam, 26 April 2012 Saeedeh Momtazi Information Systems Group Introduction Machine Learning

Word Sense Disambiguation

“Jim flew his plane to Texas.”

“Alice destroys the item with a plane.”

Saeedeh Momtazi | NLP | 26.04.2012

11

Page 12: Natural Language Processing - Hasso Plattner …Natural Language Processing Machine Learning Potsdam, 26 April 2012 Saeedeh Momtazi Information Systems Group Introduction Machine Learning

Spam Mail Detection

Saeedeh Momtazi | NLP | 26.04.2012

12

Page 13: Natural Language Processing - Hasso Plattner …Natural Language Processing Machine Learning Potsdam, 26 April 2012 Saeedeh Momtazi Information Systems Group Introduction Machine Learning

Language Identification

Saeedeh Momtazi | NLP | 26.04.2012

13

Page 14: Natural Language Processing - Hasso Plattner …Natural Language Processing Machine Learning Potsdam, 26 April 2012 Saeedeh Momtazi Information Systems Group Introduction Machine Learning

Text Categorization

---------------------------------------------

---------------------------------------------

---------------------------------------------

---------------------------------------------

news

------------------------------------------------------------------------------------------

? ? ? ?

Culture Sport Politics Business

Saeedeh Momtazi | NLP | 26.04.2012

14

Page 15: Natural Language Processing - Hasso Plattner …Natural Language Processing Machine Learning Potsdam, 26 April 2012 Saeedeh Momtazi Information Systems Group Introduction Machine Learning

Information Retrieval

Saeedeh Momtazi | NLP | 26.04.2012

15

Page 16: Natural Language Processing - Hasso Plattner …Natural Language Processing Machine Learning Potsdam, 26 April 2012 Saeedeh Momtazi Information Systems Group Introduction Machine Learning

Classification

Training

T1 → C1T2 → C2...

Tn → Cn

−−−−→

F1F2...

Fn

−−−−→�� ��Model(F,C)

Testing

Tn+1 →? −−−−→ Fn+1 −−−−→

←−−−−−−−−

Cn+1

Saeedeh Momtazi | NLP | 26.04.2012

16

Page 17: Natural Language Processing - Hasso Plattner …Natural Language Processing Machine Learning Potsdam, 26 April 2012 Saeedeh Momtazi Information Systems Group Introduction Machine Learning

Classification Algorithms

K Nearest NeighborSupport Vector MachinesNaïve BayesMaximum EntropyLinear RegressionLogistic RegressionNeural NetworksDecision TreesBoosting...

Saeedeh Momtazi | NLP | 26.04.2012

17

Page 18: Natural Language Processing - Hasso Plattner …Natural Language Processing Machine Learning Potsdam, 26 April 2012 Saeedeh Momtazi Information Systems Group Introduction Machine Learning

K Nearest Neighbor

?

Saeedeh Momtazi | NLP | 26.04.2012

18

Page 19: Natural Language Processing - Hasso Plattner …Natural Language Processing Machine Learning Potsdam, 26 April 2012 Saeedeh Momtazi Information Systems Group Introduction Machine Learning

K Nearest Neighbor

?

Saeedeh Momtazi | NLP | 26.04.2012

19

Page 20: Natural Language Processing - Hasso Plattner …Natural Language Processing Machine Learning Potsdam, 26 April 2012 Saeedeh Momtazi Information Systems Group Introduction Machine Learning

K Nearest Neighbor

1 Nearest Neighbor

Saeedeh Momtazi | NLP | 26.04.2012

20

Page 21: Natural Language Processing - Hasso Plattner …Natural Language Processing Machine Learning Potsdam, 26 April 2012 Saeedeh Momtazi Information Systems Group Introduction Machine Learning

K Nearest Neighbor

?

3 Nearest Neighbor

Saeedeh Momtazi | NLP | 26.04.2012

21

Page 22: Natural Language Processing - Hasso Plattner …Natural Language Processing Machine Learning Potsdam, 26 April 2012 Saeedeh Momtazi Information Systems Group Introduction Machine Learning

K Nearest Neighbor

3 Nearest Neighbor

Saeedeh Momtazi | NLP | 26.04.2012

22

Page 23: Natural Language Processing - Hasso Plattner …Natural Language Processing Machine Learning Potsdam, 26 April 2012 Saeedeh Momtazi Information Systems Group Introduction Machine Learning

K Nearest Neighbor

But this item is closer

3 Nearest Neighbor

Saeedeh Momtazi | NLP | 26.04.2012

23

Page 24: Natural Language Processing - Hasso Plattner …Natural Language Processing Machine Learning Potsdam, 26 April 2012 Saeedeh Momtazi Information Systems Group Introduction Machine Learning

Support Vector Machines

Saeedeh Momtazi | NLP | 26.04.2012

24

Page 25: Natural Language Processing - Hasso Plattner …Natural Language Processing Machine Learning Potsdam, 26 April 2012 Saeedeh Momtazi Information Systems Group Introduction Machine Learning

Support Vector Machines

Find a hyperplane in the vector space that separates the itemsof the two categories

Saeedeh Momtazi | NLP | 26.04.2012

25

Page 26: Natural Language Processing - Hasso Plattner …Natural Language Processing Machine Learning Potsdam, 26 April 2012 Saeedeh Momtazi Information Systems Group Introduction Machine Learning

Support Vector Machines

There might be more than one possible separating hyperplane

Saeedeh Momtazi | NLP | 26.04.2012

26

Page 27: Natural Language Processing - Hasso Plattner …Natural Language Processing Machine Learning Potsdam, 26 April 2012 Saeedeh Momtazi Information Systems Group Introduction Machine Learning

Support Vector Machines

?

There might be more than one possible separating hyperplane

Saeedeh Momtazi | NLP | 26.04.2012

27

Page 28: Natural Language Processing - Hasso Plattner …Natural Language Processing Machine Learning Potsdam, 26 April 2012 Saeedeh Momtazi Information Systems Group Introduction Machine Learning

Support Vector Machines

?

Find the hyperplane with maximum marginVectors at the margins are called support vectors

Saeedeh Momtazi | NLP | 26.04.2012

28

Page 29: Natural Language Processing - Hasso Plattner …Natural Language Processing Machine Learning Potsdam, 26 April 2012 Saeedeh Momtazi Information Systems Group Introduction Machine Learning

Support Vector Machines

Find the hyperplane with maximum marginVectors at the margins are called support vectors

Saeedeh Momtazi | NLP | 26.04.2012

29

Page 30: Natural Language Processing - Hasso Plattner …Natural Language Processing Machine Learning Potsdam, 26 April 2012 Saeedeh Momtazi Information Systems Group Introduction Machine Learning

Naïve Bayes

Selecting the class with highest probability⇒ Minimizing the number of items with wrong labels

c = argmaxciP(ci)

The probability should depend on the to be classified data (d)

P(ci |d)

Saeedeh Momtazi | NLP | 26.04.2012

30

Page 31: Natural Language Processing - Hasso Plattner …Natural Language Processing Machine Learning Potsdam, 26 April 2012 Saeedeh Momtazi Information Systems Group Introduction Machine Learning

Naïve Bayes

c = argmaxciP(ci)

c = argmaxciP(ci |d)

c = argmaxci

P(d |ci) · P(ci)

P(d) P(d) has no effect

c = argmaxciP(d |ci) · P(ci)

Saeedeh Momtazi | NLP | 26.04.2012

31

Page 32: Natural Language Processing - Hasso Plattner …Natural Language Processing Machine Learning Potsdam, 26 April 2012 Saeedeh Momtazi Information Systems Group Introduction Machine Learning

Naïve Bayes

c = argmaxciP(d |ci) · P(ci)

Likelihood

ProbabilityPrior

Probability

Saeedeh Momtazi | NLP | 26.04.2012

32

Page 33: Natural Language Processing - Hasso Plattner …Natural Language Processing Machine Learning Potsdam, 26 April 2012 Saeedeh Momtazi Information Systems Group Introduction Machine Learning

Maximum Entropy

Assigning a weight λj to each feature fjPositive weight: the feature is likely to be effectiveNegative weight: the feature is likely to be ineffective

Picking out a subset of data by each feature

Voting for each class based on the sum of weighted features

c = argmaxciP(ci |d , λ)

Saeedeh Momtazi | NLP | 26.04.2012

33

Page 34: Natural Language Processing - Hasso Plattner …Natural Language Processing Machine Learning Potsdam, 26 April 2012 Saeedeh Momtazi Information Systems Group Introduction Machine Learning

Maximum Entropy

c = argmaxciP(ci |d , λ)

P(ci |d , λ) =exp

∑j λj · fj(c,d)∑

ciexp

∑j λj · fj(ci ,d)

The expectation of each feature is calculated as follows:

E(fi) =∑

(c,d)∈(C,D)

P(c,d) · fi(c,d)

Saeedeh Momtazi | NLP | 26.04.2012

34

Page 35: Natural Language Processing - Hasso Plattner …Natural Language Processing Machine Learning Potsdam, 26 April 2012 Saeedeh Momtazi Information Systems Group Introduction Machine Learning

Classification

Training

T1 → C1T2 → C2...

Tn → Cn

−−−−→

F1F2...

Fn

−−−−→�� ��Model(F,C)

Testing

Tn+1 →? −−−−→ Fn+1 −−−−→

←−−−−−−−−

Cn+1

Saeedeh Momtazi | NLP | 26.04.2012

35

Page 36: Natural Language Processing - Hasso Plattner …Natural Language Processing Machine Learning Potsdam, 26 April 2012 Saeedeh Momtazi Information Systems Group Introduction Machine Learning

Feature Selection

size

age

4

7

44

44

4

7

7

7

7

7

LocationNumber of roomsBalcony...

Saeedeh Momtazi | NLP | 26.04.2012

36

Page 37: Natural Language Processing - Hasso Plattner …Natural Language Processing Machine Learning Potsdam, 26 April 2012 Saeedeh Momtazi Information Systems Group Introduction Machine Learning

Feature Selection

Bag-of-words:Each document can be represented by the set of words thatappear in the documentResult is a high dimensional feature spaceThe process is computationally expensive

SolutionUsing a feature selection method to select informative words

Saeedeh Momtazi | NLP | 26.04.2012

37

Page 38: Natural Language Processing - Hasso Plattner …Natural Language Processing Machine Learning Potsdam, 26 April 2012 Saeedeh Momtazi Information Systems Group Introduction Machine Learning

Feature Selection Methods

Information GainMutual Informationχ-Square

Saeedeh Momtazi | NLP | 26.04.2012

38

Page 39: Natural Language Processing - Hasso Plattner …Natural Language Processing Machine Learning Potsdam, 26 April 2012 Saeedeh Momtazi Information Systems Group Introduction Machine Learning

Information Gain

Measuring the number of bits required for category predictionw.r.t. the presence or absence of a term in the documentRemoving words whose information gain is less than apredefined threshold

IG(w) =−K∑

i=1

P(ci) log P(ci)

+ P(w)K∑

i=1

P(ci |w) log P(ci |w)

+ P(w)K∑

i=1

P(ci |w) log P(ci |w)

Saeedeh Momtazi | NLP | 26.04.2012

39

Page 40: Natural Language Processing - Hasso Plattner …Natural Language Processing Machine Learning Potsdam, 26 April 2012 Saeedeh Momtazi Information Systems Group Introduction Machine Learning

Information Gain

P(ci) =NiN

P(w) = NwN P(ci |w) = Niw

Ni

P(w) = NwN P(ci |w) = Niw

Ni

N: # docsNi : # docs in category ciNw : # docs containing wNw : # docs not containing wNiw : # docs in category ci containing wNiw : # docs in category ci not containing w

Saeedeh Momtazi | NLP | 26.04.2012

40

Page 41: Natural Language Processing - Hasso Plattner …Natural Language Processing Machine Learning Potsdam, 26 April 2012 Saeedeh Momtazi Information Systems Group Introduction Machine Learning

Mutual Information

Measuring the effect of each word in predicting the categoryHow much does its presence or absence in a documentcontribute to category prediction?

MI(w , ci) = logP(w , ci)

P(w) · P(ci)

Removing words whose mutual information is less than apredefined threshold

MI(w) = maxiMI(w , ci)

MI(w) =∑

i

P(ci) ·MI(w , ci)

Saeedeh Momtazi | NLP | 26.04.2012

41

Page 42: Natural Language Processing - Hasso Plattner …Natural Language Processing Machine Learning Potsdam, 26 April 2012 Saeedeh Momtazi Information Systems Group Introduction Machine Learning

χ-square

Measuring the dependencies between words and categories

χ2(w , ci) =N · (NiwNiw − NiwNiw )

2

(Niw + Niw ) · (Niw + Niw ) · (Niw + Niw ) · (Niw + Niw )

Ranking words based on their χ-square measure

χ2(w) =K∑

i=1

P(ci) · χ2(w , ci)

Selecting the top words as features

Saeedeh Momtazi | NLP | 26.04.2012

42

Page 43: Natural Language Processing - Hasso Plattner …Natural Language Processing Machine Learning Potsdam, 26 April 2012 Saeedeh Momtazi Information Systems Group Introduction Machine Learning

Feature Selection

These models perform well for document-level classificationSpam Mail DetectionLanguage IdentificationText Categorization

Word-level Classification might need another types of featuresPOS TaggingNamed Entity Recognition

(will be discussed later)

Saeedeh Momtazi | NLP | 26.04.2012

43

Page 44: Natural Language Processing - Hasso Plattner …Natural Language Processing Machine Learning Potsdam, 26 April 2012 Saeedeh Momtazi Information Systems Group Introduction Machine Learning

Shortcoming

Data annotation is labor intensive

Solution:Using a minimum amount of annotated dataAnnotating further data by human, if they are very informative

Active Learning

Saeedeh Momtazi | NLP | 26.04.2012

44

Page 45: Natural Language Processing - Hasso Plattner …Natural Language Processing Machine Learning Potsdam, 26 April 2012 Saeedeh Momtazi Information Systems Group Introduction Machine Learning

Active Learning

Saeedeh Momtazi | NLP | 26.04.2012

45

Page 46: Natural Language Processing - Hasso Plattner …Natural Language Processing Machine Learning Potsdam, 26 April 2012 Saeedeh Momtazi Information Systems Group Introduction Machine Learning

Active Learning

Annotating a small amount of data

Saeedeh Momtazi | NLP | 26.04.2012

46

Page 47: Natural Language Processing - Hasso Plattner …Natural Language Processing Machine Learning Potsdam, 26 April 2012 Saeedeh Momtazi Information Systems Group Introduction Machine Learning

Active Learning

MM

MM

LL

L

H H

H

HH

HM

Annotating a small amount of data

Calculating the confidence score of the classifier onunlabeled data

Saeedeh Momtazi | NLP | 26.04.2012

47

Page 48: Natural Language Processing - Hasso Plattner …Natural Language Processing Machine Learning Potsdam, 26 April 2012 Saeedeh Momtazi Information Systems Group Introduction Machine Learning

Active Learning

Annotating a small amount of data

Calculating the confidence score of the classifier onunlabeled data

Finding the informative unlabeled data(data with lowest confidence)

Annotating the informative data by the human

Saeedeh Momtazi | NLP | 26.04.2012

48

Page 49: Natural Language Processing - Hasso Plattner …Natural Language Processing Machine Learning Potsdam, 26 April 2012 Saeedeh Momtazi Information Systems Group Introduction Machine Learning

Active Learning

Amazon Mechanical Turk

Saeedeh Momtazi | NLP | 26.04.2012

49

Page 50: Natural Language Processing - Hasso Plattner …Natural Language Processing Machine Learning Potsdam, 26 April 2012 Saeedeh Momtazi Information Systems Group Introduction Machine Learning

Outline

1 Supervised Learning

2 Semi-Supervised Learning

3 Unsupervised Learning

Saeedeh Momtazi | NLP | 26.04.2012

50

Page 51: Natural Language Processing - Hasso Plattner …Natural Language Processing Machine Learning Potsdam, 26 April 2012 Saeedeh Momtazi Information Systems Group Introduction Machine Learning

Semi-Supervised Learning

Problem of data annotation

Solution:Using minimum amount of annotated dataAnnotating further data automatically, if they are easy to predict

Saeedeh Momtazi | NLP | 26.04.2012

51

Page 52: Natural Language Processing - Hasso Plattner …Natural Language Processing Machine Learning Potsdam, 26 April 2012 Saeedeh Momtazi Information Systems Group Introduction Machine Learning

Semi-Supervised Learning

A small amount of labeled data

Saeedeh Momtazi | NLP | 26.04.2012

52

Page 53: Natural Language Processing - Hasso Plattner …Natural Language Processing Machine Learning Potsdam, 26 April 2012 Saeedeh Momtazi Information Systems Group Introduction Machine Learning

Semi-Supervised Learning

A small amount of labeled data

A large amount of unlabeled data

Saeedeh Momtazi | NLP | 26.04.2012

53

Page 54: Natural Language Processing - Hasso Plattner …Natural Language Processing Machine Learning Potsdam, 26 April 2012 Saeedeh Momtazi Information Systems Group Introduction Machine Learning

Semi-Supervised Learning

A small amount of labeled data

A large amount of unlabeled data

SolutionFinding the similarity between the labeled andunlabeled dataPredicting the labels of the unlabeled data

Saeedeh Momtazi | NLP | 26.04.2012

54

Page 55: Natural Language Processing - Hasso Plattner …Natural Language Processing Machine Learning Potsdam, 26 April 2012 Saeedeh Momtazi Information Systems Group Introduction Machine Learning

Semi-Supervised Learning

Training the classifier usingThe labeled dataPredicted labels ofthe unlabeled data

Saeedeh Momtazi | NLP | 26.04.2012

55

Page 56: Natural Language Processing - Hasso Plattner …Natural Language Processing Machine Learning Potsdam, 26 April 2012 Saeedeh Momtazi Information Systems Group Introduction Machine Learning

Shortcoming

Introducing a lot of noisy data to the system

Saeedeh Momtazi | NLP | 26.04.2012

56

Page 57: Natural Language Processing - Hasso Plattner …Natural Language Processing Machine Learning Potsdam, 26 April 2012 Saeedeh Momtazi Information Systems Group Introduction Machine Learning

Shortcoming

Introducing a lot of noisy data to the system

SolutionAdding unlabeled data to the training set, if the predicted labelhas a high confidence

Saeedeh Momtazi | NLP | 26.04.2012

57

Page 58: Natural Language Processing - Hasso Plattner …Natural Language Processing Machine Learning Potsdam, 26 April 2012 Saeedeh Momtazi Information Systems Group Introduction Machine Learning

Semi-Supervised Learning

Saeedeh Momtazi | NLP | 26.04.2012

58

Page 59: Natural Language Processing - Hasso Plattner …Natural Language Processing Machine Learning Potsdam, 26 April 2012 Saeedeh Momtazi Information Systems Group Introduction Machine Learning

Related Books

Semi-Supervised Learning

by O. Chapelle, B. Schölkopf, A. ZienMIT Press

2006

Semisupervised Learning forComputational Linguistics

by S. AbneyChapman & Hall

2007

Saeedeh Momtazi | NLP | 26.04.2012

59

Page 60: Natural Language Processing - Hasso Plattner …Natural Language Processing Machine Learning Potsdam, 26 April 2012 Saeedeh Momtazi Information Systems Group Introduction Machine Learning

Outline

1 Supervised Learning

2 Semi-Supervised Learning

3 Unsupervised Learning

Saeedeh Momtazi | NLP | 26.04.2012

60

Page 61: Natural Language Processing - Hasso Plattner …Natural Language Processing Machine Learning Potsdam, 26 April 2012 Saeedeh Momtazi Information Systems Group Introduction Machine Learning

Unsupervised Learning

area

age

444

4

4

4

7

7

7

7

7

7

Saeedeh Momtazi | NLP | 26.04.2012

61

Page 62: Natural Language Processing - Hasso Plattner …Natural Language Processing Machine Learning Potsdam, 26 April 2012 Saeedeh Momtazi Information Systems Group Introduction Machine Learning

Unsupervised Learning

area

age

777

77

7

7

7

7

7

7

7

Saeedeh Momtazi | NLP | 26.04.2012

62

Page 63: Natural Language Processing - Hasso Plattner …Natural Language Processing Machine Learning Potsdam, 26 April 2012 Saeedeh Momtazi Information Systems Group Introduction Machine Learning

Clustering

Working based on the similarities between the data items

Assigning the similar data items to the same cluster

Saeedeh Momtazi | NLP | 26.04.2012

63

Page 64: Natural Language Processing - Hasso Plattner …Natural Language Processing Machine Learning Potsdam, 26 April 2012 Saeedeh Momtazi Information Systems Group Introduction Machine Learning

Applications

Word ClusteringSpeech RecognitionMachine TranslationNamed Entity RecognitionInformation Retrieval...

Document ClusteringInformation Retrieval...

Saeedeh Momtazi | NLP | 26.04.2012

64

Page 65: Natural Language Processing - Hasso Plattner …Natural Language Processing Machine Learning Potsdam, 26 April 2012 Saeedeh Momtazi Information Systems Group Introduction Machine Learning

Speech recognition

⇒ “Computers can recognize speech.”

“Computers can wreck a nice peach.”

Saeedeh Momtazi | NLP | 26.04.2012

65

Page 66: Natural Language Processing - Hasso Plattner …Natural Language Processing Machine Learning Potsdam, 26 April 2012 Saeedeh Momtazi Information Systems Group Introduction Machine Learning

Machine Translation

“The cat eats ...” ⇒ “Die Katze frisst ...”“Die Katze isst ...”

Saeedeh Momtazi | NLP | 26.04.2012

66

Page 67: Natural Language Processing - Hasso Plattner …Natural Language Processing Machine Learning Potsdam, 26 April 2012 Saeedeh Momtazi Information Systems Group Introduction Machine Learning

Language Modeling

Corpus Texts:“I have a meeting on Monday evening”“You should work on Wednesday afternoon”“The next session of the NLP lecture in on Thursday morning”

no observation in the corpus.

⇒ “The talk is on Monday morning.”

“The talk is on Monday molding.”

Monday

Tuesday

Wednesday

Thursday

Sunday

Saturday

Friday

afternoonmorning

evening

night

Saeedeh Momtazi | NLP | 26.04.2012

67

Page 68: Natural Language Processing - Hasso Plattner …Natural Language Processing Machine Learning Potsdam, 26 April 2012 Saeedeh Momtazi Information Systems Group Introduction Machine Learning

Language Modeling

Corpus Texts:“I have a meeting on Monday evening”“You should work on Wednesday afternoon”“The next session of the NLP lecture in on Thursday morning”

⇒ [Week-day] [day-time]

Class-based Language Model

Saeedeh Momtazi | NLP | 26.04.2012

68

Page 69: Natural Language Processing - Hasso Plattner …Natural Language Processing Machine Learning Potsdam, 26 April 2012 Saeedeh Momtazi Information Systems Group Introduction Machine Learning

Information Retrieval

“The first car was invented by Karl Benz.”

“Thomas Edison invented the first commercially practical light.”

“Alexander Graham Bell invented the first practical telephone.”

vehicleautomobile

car

Saeedeh Momtazi | NLP | 26.04.2012

69

Page 70: Natural Language Processing - Hasso Plattner …Natural Language Processing Machine Learning Potsdam, 26 April 2012 Saeedeh Momtazi Information Systems Group Introduction Machine Learning

Information Retrieval

Saeedeh Momtazi | NLP | 26.04.2012

70

Page 71: Natural Language Processing - Hasso Plattner …Natural Language Processing Machine Learning Potsdam, 26 April 2012 Saeedeh Momtazi Information Systems Group Introduction Machine Learning

Information Retrieval

Clustering documents

based on their similarities

Saeedeh Momtazi | NLP | 26.04.2012

71

Page 72: Natural Language Processing - Hasso Plattner …Natural Language Processing Machine Learning Potsdam, 26 April 2012 Saeedeh Momtazi Information Systems Group Introduction Machine Learning

Clustering Algorithms

FlatK-means

HierarchicalTop-Down (Divisive)Bottom-Up (Agglomerative)

Single-linkComplete-linkAverage-link

Saeedeh Momtazi | NLP | 26.04.2012

72

Page 73: Natural Language Processing - Hasso Plattner …Natural Language Processing Machine Learning Potsdam, 26 April 2012 Saeedeh Momtazi Information Systems Group Introduction Machine Learning

K-means

The best known clustering algorithmWorks well for many casesUsed as default / baseline for clustering documents

AlgorithmDefining each cluster center as the mean or centroid of the itemsin the cluster

~µ =1|c|

∑~x∈c

~x

Minimizing the average squared Euclidean distance of the itemsfrom their cluster centers

Saeedeh Momtazi | NLP | 26.04.2012

73

Page 74: Natural Language Processing - Hasso Plattner …Natural Language Processing Machine Learning Potsdam, 26 April 2012 Saeedeh Momtazi Information Systems Group Introduction Machine Learning

K-meansInitialization: Randomly choose k items as initial centroidswhile stopping criterion has not been met do

for each item doFind the nearest centroidAssign the item to the cluster associated with the nearest centroid

end forfor each cluster do

Update the centroid of the cluster based on the average of all items in the clusterend for

end while

Iterating two steps:Re-assignment

Assigning each vector to its closest centroid

Re-computation

Computing each centroid as the average of the vectors that were assigned to it in re-assignment

Saeedeh Momtazi | NLP | 26.04.2012

74

Page 75: Natural Language Processing - Hasso Plattner …Natural Language Processing Machine Learning Potsdam, 26 April 2012 Saeedeh Momtazi Information Systems Group Introduction Machine Learning

K-means

http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/

AppletKM.html

Saeedeh Momtazi | NLP | 26.04.2012

75

Page 76: Natural Language Processing - Hasso Plattner …Natural Language Processing Machine Learning Potsdam, 26 April 2012 Saeedeh Momtazi Information Systems Group Introduction Machine Learning

Hierarchical AgglomerativeClustering (HAC)

Creating a hierarchy in the form of a binary tree

Saeedeh Momtazi | NLP | 26.04.2012

76

Page 77: Natural Language Processing - Hasso Plattner …Natural Language Processing Machine Learning Potsdam, 26 April 2012 Saeedeh Momtazi Information Systems Group Introduction Machine Learning

Hierarchical AgglomerativeClustering (HAC)

Initial Mapping: Put a single item in each clusterwhile reaching the predefined number of clusters do

for each pair of clusters doMeasure the similarity of two clusters

end forMerge the two clusters that are most similar

end while

Measuring the similarity in three ways:Single-linkComplete-linkAverage-link

Saeedeh Momtazi | NLP | 26.04.2012

77

Page 78: Natural Language Processing - Hasso Plattner …Natural Language Processing Machine Learning Potsdam, 26 April 2012 Saeedeh Momtazi Information Systems Group Introduction Machine Learning

Hierarchical AgglomerativeClustering (HAC)

Single-link / single-linkage clusteringBased on the similarity of the most similar members

Saeedeh Momtazi | NLP | 26.04.2012

78

Page 79: Natural Language Processing - Hasso Plattner …Natural Language Processing Machine Learning Potsdam, 26 April 2012 Saeedeh Momtazi Information Systems Group Introduction Machine Learning

Hierarchical AgglomerativeClustering (HAC)

Complete-link / complete-linkage clusteringBased on the similarity of the most dissimilar members

Saeedeh Momtazi | NLP | 26.04.2012

79

Page 80: Natural Language Processing - Hasso Plattner …Natural Language Processing Machine Learning Potsdam, 26 April 2012 Saeedeh Momtazi Information Systems Group Introduction Machine Learning

Hierarchical AgglomerativeClustering (HAC)

Average-link / average-linkage clusteringBased on the average of all similarities between the members

Saeedeh Momtazi | NLP | 26.04.2012

80

Page 81: Natural Language Processing - Hasso Plattner …Natural Language Processing Machine Learning Potsdam, 26 April 2012 Saeedeh Momtazi Information Systems Group Introduction Machine Learning

Hierarchical AgglomerativeClustering (HAC)

http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/

AppletH.html

Saeedeh Momtazi | NLP | 26.04.2012

81

Page 82: Natural Language Processing - Hasso Plattner …Natural Language Processing Machine Learning Potsdam, 26 April 2012 Saeedeh Momtazi Information Systems Group Introduction Machine Learning

Further Reading

Introduction to Information Retrieval

C.D. Manning, P. Raghavan, H. Schütze Cambridge

University Press 2008

http://nlp.stanford.edu/IR-book/html/htmledition/irbook.html

Chapters 13,14,15,16,17

Saeedeh Momtazi | NLP | 26.04.2012

82