View
242
Download
0
Tags:
Embed Size (px)
Citation preview
Example: Text Classification
Task:Given an article, predict its category
Categories:Sports, entertainment, news, weather,..Spam/not spam
Example: Text Classification
Task:Given an article, predict its category
Categories:Sports, entertainment, news, weather,..Spam/not spam
What kind of information is useful for this task?
Classification TaskTask:
C is a finite set of labels (aka categories, classes)Given x, determine its category y in C
Classification TaskTask:
C is a finite set of labels (aka categories, classes)Given x, determine its category y in C
Instance: (x,y)x: thing to be labeled/classifiedy: label/class
Classification TaskTask:
C is a finite set of labels (aka categories, classes)Given x, determine its category y in C
Instance: (x,y)x: thing to be labeled/classifiedy: label/class
Data: set of instances labeled data: y is knownunlabeled data: y is unknown
Classification TaskTask:
C is a finite set of labels (aka categories, classes) Given x, determine its category y in C
Instance: (x,y) x: thing to be labeled/classified y: label/class
Data: set of instances labeled data: y is known unlabeled data: y is unknown
Training data, test data
Text Classification Examples
Spam filtering
Call routing
Sentiment classificationPositive/NegativeScore: 1 to 5
POS TaggingTask: Given a sentence, predict tag of each word
Is this a classification problem?
Categories: N, V, Adj,…
What information is useful?
POS TaggingTask: Given a sentence, predict tag of each word
Is this a classification problem?
Categories: N, V, Adj,…
What information is useful?
How do POS tagging, text classification differ?
POS TaggingTask: Given a sentence, predict tag of each word
Is this a classification problem?
Categories: N, V, Adj,…
What information is useful?
How do POS tagging, text classification differ?Sequence labeling problem
Word SegmentationTask: Given a string, break into words
Categories: B(reak), NB (no break)B(eginning), I(nside), E(nd)
e.g. c1 c2 || c3 c4 c5
Word SegmentationTask: Given a string, break into words
Categories: B(reak), NB (no break)B(eginning), I(nside), E(nd)
e.g. c1 c2 || c3 c4 c5c1/NB c2/B c3/NB c4/NB c5/Bc1/B c2/E c3/B c4/I c5/E
What type of task?
Word SegmentationTask: Given a string, break into words
Categories: B(reak), NB (no break)B(eginning), I(nside), E(nd)
e.g. c1 c2 || c3 c4 c5c1/NB c2/B c3/NB c4/NB c5/Bc1/B c2/E c3/B c4/I c5/E
What type of task?Also sequence labeling
Two StagesTraining:
Learner: training data classifier
Testing:Decoder: test data + classifier classification
output
Two StagesTraining:
Learner: training data classifier
Testing:Decoder: test data + classifier classification
output
AlsoPreprocessingPostprocessingEvaluation
Representing InputPotentially infinite values to represent
Represent input as feature vectorx=<v1,v2,v3,…,vn>
x=<f1=v1,f2=v2,…,fn=vn>
Representing InputPotentially infinite values to represent
Represent input as feature vectorx=<v1,v2,v3,…,vn>
x=<f1=v1,f2=v2,…,fn=vn>
What are good features?
Doc1Western Union Money Transfer [email protected] Bishops Square Akpakpa E1 6AO, CotonouBenin RepublicWebsite: http://www.westernunion.com/ info/selectCountry.asPPhone: +229 99388639
Attention Beneficiary,
This to inform you that the federal ministry of finance Benin Republic has started releasing scam victim compensation fund mandated by United Nation Organization through our office.
I am contacting you because our agent have sent you the first payment of $5,000 for your compensation funds total amount of $500 000 USD (Five hundred thousand united state dollar)
We need your urgent response so that we shall release your payment information to you.
You can call our office hot line for urgent attention(+22999388639)
Doc2 Hello! my dear. How are you today and your family? I hope all is
good,kindly pay Attention and understand my aim of communicating you todaythrough this Letter, My names is Saif al-Islam al-Gaddafi the Son offormer Libyan President. i was born on 1972 in Tripoli Libya,By Gaddafi’ssecond wive.I want you to help me clear this fund in your name which i deposited inEurope please i would like this money to be transferred into your accountbefore they find it.the amount is 20.300,000 million GBP British Pounds sterling through a
Doc4 from: [email protected]
REMINDER:
If you have not received a PIN number to vote in the elections and have not already contacted us, please contact either Drago Radev ([email protected]) or Priscilla Rasmussen ([email protected]) right away.
Everyone who has not received a pin but who has contacted us already will get a new pin over the weekend.
Anyone who still wants to join for 2011 needs to do this by Monday (November 7th) in order to be eligible to vote.
And, if you do have your PIN number and have not voted yet, remember every vote counts!
Possible FeaturesWords!
Feature for each wordBinary: presence/absenceInteger: occurrence countParticular word types: money/sex/: [Vv].*gr.*
Possible FeaturesWords!
Feature for each wordBinary: presence/absenceInteger: occurrence countParticular word types: money/sex/: [Vv].*gr.*
Errors:Spelling, grammar
Possible FeaturesWords!
Feature for each wordBinary: presence/absenceInteger: occurrence countParticular word types: money/sex/: [Vv].*gr.*
Errors:Spelling, grammar
Images
Possible FeaturesWords!
Feature for each wordBinary: presence/absenceInteger: occurrence countParticular word types: money/sex/: [Vv].*gr.*
Errors:Spelling, grammar
Images
Header info
Representing Input:Attribute-Value Matrix
f1
Currency
f2
Country
… fm
Date
Label
x1= Doc1
x2=Doc2
..
xn=Doc4
Representing Input:Attribute-Value Matrix
f1
Currency
f2
Country
… fm
Date
Label
x1= Doc1 Spam
x2=Doc2 Spam
..
xn=Doc4 NotSpam
Representing Input:Attribute-Value Matrix
f1
Currency
f2
Country
… fm
Date
Label
x1= Doc1 1 1 0.3 0 Spam
x2=Doc2 Spam
..
xn=Doc4 NotSpam
Representing Input:Attribute-Value Matrix
f1
Currency
f2
Country
… fm
Date
Label
x1= Doc1 1 1 0.3 0 Spam
x2=Doc2 1 1 1.75 1 Spam
..
xn=Doc4 NotSpam
Representing Input:Attribute-Value Matrix
f1
Currency
f2
Country
… fm
Date
Label
x1= Doc1 1 1 0.3 0 Spam
x2=Doc2 1 1 1.75 1 Spam
..
xn=Doc4 0 0 0 2 NotSpam
ClassifierResult of training on input data
With or without class labels
Formal perspective:f(x) =y: x is input; y in C
ClassifierResult of training on input data
With or without class labels
Formal perspective: f(x) =y: x is input; y in C
More generally:f(x)={(ci,scorei)}, where
x is input,
ci in C,
scorei is score for category assignment
Testing Input:
Test data:e.g. AVM
Classifier
Output: Decision matrix Can assign highest
scoring class to each input
Testing Input:
Test data:e.g. AVM
Classifier
Output: Decision matrix Can assign highest
scoring class to each input
x1 x2 x3 ….
c1 0.1 0.1 0.2 …
c2 0 0.8 0 …
c3 0.2 0 0.7 …
c4
…..
0.7 0.1 0.1 …
Testing Input:
Test data:e.g. AVM
Classifier
Output: Decision matrix Can assign highest
scoring class to each input
x1 x2 x3 ….
c1 0.1 0.1 0.2 …
c2 0 0.8 0 …
c3 0.2 0 0.7 …
c4
…..
0.7 0.1 0.1 …
x1 x2 x3
c4 c2 c3
EvaluationConfusion matrix:
Precision: TP/(TP+FP)
Recall: TP/(TP+FN)
F-score: 2PR/(P+R)
GoldSystem
+ -
+ TP FP
- FN TN
EvaluationConfusion matrix:
Precision: TP/(TP+FP)
Recall: TP/(TP+FN)
F-score: 2PR/(P+R)
Accuracy = (TP+TN)/(TP+TN+FP+TN)
GoldSystem
+ -
+ TP FP
- FN TN
EvaluationConfusion matrix:
Precision: TP/(TP+FP)
Recall: TP/(TP+FN)
F-score: 2PR/(P+R)
Accuracy = (TP+TN)/(TP+TN+FP+TN)
Why F-score? Accuracy?
GoldSystem
+ -
+ TP FP
- FN TN
Evaluation ExampleConfusion matrix:
Precision: 1/(1+4)=1/5
Recall: TP/(TP+FN)=1/6
F-score: 2PR/(P+R)=2*1/5*1/6/(1/5+1/6)=2/11
Accuracy = 91%
GoldSystem
+ -
+ 1 4
- 5 90
Evaluation ExampleConfusion matrix:
Precision: 1/(1+4)=1/5
Recall: TP/(TP+FN)
GoldSystem
+ -
+ 1 4
- 5 90
Evaluation ExampleConfusion matrix:
Precision: 1/(1+4)=1/5
Recall: TP/(TP+FN)=1/6
F-score: 2PR/(P+R)=
GoldSystem
+ -
+ 1 4
- 5 90
Evaluation ExampleConfusion matrix:
Precision: 1/(1+4)=1/5
Recall: TP/(TP+FN)=1/6
F-score: 2PR/(P+R)=2*1/5*1/6/(1/5+1/6)=2/11
Accuracy = 91%
GoldSystem
+ -
+ 1 4
- 5 90
Classification Problem Steps
Input processing:Split data into training/dev/testConvert data into an Attribute-Value Matrix
Identify candidate featuresPerform feature selectionCreate AVM representation
Classification Problem Steps
Input processing:Split data into training/dev/testConvert data into an Attribute-Value Matrix
Identify candidate featuresPerform feature selectionCreate AVM representation
Training
Classification Problem Steps
Input processing:Split data into training/dev/testConvert data into an Attribute-Value Matrix
Identify candidate featuresPerform feature selectionCreate AVM representation
Training
Testing
Evaluation
Classification AlgorithmsWill be covered in detail in 572
Nearest Neighbor
Naïve Bayes
Decision Trees
Neural Networks
Maximum Entropy
Feature Design & Representation
What sorts of information do we want to encode?words, frequencies, ngrams, morphology, sentence
length, etc
Issue
Feature Design & Representation
What sorts of information do we want to encode?words, frequencies, ngrams, morphology, sentence
length, etc
Issue: Learning algorithms work on numbersMany work only on binary values (0/1)
Feature Design & Representation
What sorts of information do we want to encode?words, frequencies, ngrams, morphology, sentence
length, etc
Issue: Learning algorithms work on numbersMany work only on binary values (0/1)Others work on any real-valued input
How can we represent different information Numerically? Binary?
RepresentationWords/tags/ngrams/etc
One feature per item: Binary: presence/absenceReal: counts
Binarizing numeric features:
RepresentationWords/tags/ngrams/etc
One feature per item: Binary: presence/absenceReal: counts
Binarizing numeric features:Single thresholdMultiple thresholdsBinning: 1 binary feature/bin
Feature TemplateExample: Prevword (or w-1)
Template corresponds to many featurese.g. time flies like an arrow
Feature TemplateExample: Prevword (or w-1)
Template corresponds to many featurese.g. time flies like an arroww-1=<s>
w-1=time
w-1=flies
w-1=like
w-1=an…
Feature TemplateExample: Prevword (or w-1)
Template corresponds to many featurese.g. time flies like an arroww-1=<s>
w-1=time
w-1=flies
w-1=like
w-1=an…
Shorthand for: w-1=<s> 0 or w-1=time 1
AVM ExampleTime flies like an arrow
Note: this is a compact form of the true sparse vectorw-1=w 0 or 1, for w in |V|
w-1 w0 w-1w0 w+1 label
x1
x2
x3
AVM ExampleTime flies like an arrow
Note: this is a compact form of the true sparse vectorw-1=w 0 or 1, for w in |V|
w-1 w0 w-1w0 w+1 label
x1 <s> Time <s>Time
flies N
x2
x3
AVM ExampleTime flies like an arrow
Note: this is a compact form of the true sparse vectorw-1=w 0 or 1, for w in |V|
w-1 w0 w-1w0 w+1 label
x1 <s> Time <s>Time
flies N
x2 Time flies Time flies
like V
x3
AVM ExampleTime flies like an arrow
w-1 w0 w-1w0 w+1 label
x1 <s> Time <s>Time
flies N
x2 Time flies Time flies
like V
x3 flies like flies like an P
AVM ExampleTime flies like an arrow
Note: this is a compact form of the true sparse vectorw-1=w 0 or 1, for w in |V|
w-1 w0 w-1w0 w+1 label
x1 <s> Time <s>Time
flies N
x2 Time flies Time flies
like V
x3 flies like flies like an P
Example: NERNamed Entity tagging:
John visited New York last Friday [person John] visited [location New York] [time last
Friday]
As a classification problem
Example: NERNamed Entity tagging:
John visited New York last Friday [person John] visited [location New York] [time last
Friday]
As a classification problem John/PER-B visited/O New/LOC-B York/LOC-I last/TIME-
B Friday/TIME-I
Example: NERNamed Entity tagging:
John visited New York last Friday [person John] visited [location New York] [time last
Friday]
As a classification problem John/PER-B visited/O New/LOC-B York/LOC-I last/TIME-
B Friday/TIME-I
Input?
Example: NERNamed Entity tagging:
John visited New York last Friday [person John] visited [location New York] [time last
Friday]
As a classification problem John/PER-B visited/O New/LOC-B York/LOC-I last/TIME-B
Friday/TIME-I
Input?
Categories?
Example: CoreferenceQueen Elizabeth set about transforming her
husband, King George VI, into a viable monarch. Logue, a renowned speech therapist, was summoned to help the King overcome his speech impediment...
Example: CoreferenceQueen Elizabeth set about transforming her
husband, King George VI, into a viable monarch. Logue, a renowned speech therapist, was summoned to help the King overcome his speech impediment...
Can be viewed as a classification problem
Example: CoreferenceQueen Elizabeth set about transforming her
husband, King George VI, into a viable monarch. Logue, a renowned speech therapist, was summoned to help the King overcome his speech impediment...
Can be viewed as a classification problem
What are the inputs?
Example: CoreferenceQueen Elizabeth set about transforming her
husband, King George VI, into a viable monarch. Logue, a renowned speech therapist, was summoned to help the King overcome his speech impediment...
Can be viewed as a classification problem
What are the inputs?
What are the categories?
Example: CoreferenceQueen Elizabeth set about transforming her
husband, King George VI, into a viable monarch. Logue, a renowned speech therapist, was summoned to help the King overcome his speech impediment...
Can be viewed as a classification problem
What are the inputs?
What are the categories?
What features would be useful?