50
Classification II Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell, Andrew Moore, Percy Liang, Luke Zettlemoyer, Rob Pless, Killian Weinberger, Deva Ramanan 1

Classification II Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

Embed Size (px)

Citation preview

Page 1: Classification II Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

1

Classification II

Tamara Berg

CS 560 Artificial Intelligence

Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell, Andrew Moore, Percy Liang, Luke Zettlemoyer, Rob Pless, Killian Weinberger, Deva Ramanan

Page 2: Classification II Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

2

Announcements

• HW3 due tomorrow, 11:59pm

• Midterm2 next Wednesday, Nov 4 – Bring a simple calculator– You may bring one 3x5 notecard of notes (both sides)

• Monday, Nov 2 we will have in class practice questions

Page 3: Classification II Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

3

Midterm Topic List

Probability– Random variables – Axioms of probability– Joint, marginal, conditional probability distributions – Independence and conditional independence– Product rule, chain rule, Bayes rule

Bayesian Networks General– Structure and parameters – Calculating joint and conditional probabilities– Independence in Bayes Nets (Bayes Ball)

Bayesian Inference– Exact Inference (Inference by Enumeration, Variable Elimination)– Approximate Inference (Forward Sampling, Rejection Sampling, Likelihood

Weighting)– Networks for which efficient inference is possible

Page 4: Classification II Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

4

Midterm Topic ListNaïve Bayes

– Parameter learning including Laplace smoothing– Likelihood, prior, posterior – Maximum likelihood (ML), maximum a posteriori (MAP) inference – Application to spam/ham classification and image classification

HMMs– Markov Property– Markov Chains– Hidden Markov Model (initial distribution, transitions, emissions)– Filtering (forward algorithm)– Application to speech recognition and robot localization

Page 5: Classification II Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

5

Midterm Topic List

Machine Learning– Unsupervised/supervised/semi-supervised learning– K Means clustering– Hierarchical clustering (agglomerative, divisive)– Training, tuning, testing, generalization– Nearest Neighbor– Decision Trees– Boosting– Application of algorithms to research problems (e.g. visual word discovery, pose

estimation, im2gps, scene completion, face detection)

Page 6: Classification II Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

The basic classification framework

y = f(x)

• Learning: given a training set of labeled examples {(x1,y1), …, (xN,yN)}, estimate the parameters of the prediction function f

• Inference: apply f to a never before seen test example x and output the predicted value y = f(x)

output classification function

input

Page 7: Classification II Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

7

Classification by Nearest Neighbor

Word vector document classification – here the vector space is illustrated as having 2 dimensions. How many dimensions would the data actually live in?

Page 8: Classification II Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

8

Classification by Nearest Neighbor

Page 9: Classification II Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

9

Classification by Nearest Neighbor

Classify the test document as the class of the document “nearest” to the query document (use vector similarity, e.g. Euclidean distance, to find most similar doc)

Page 10: Classification II Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

10

Classification by kNN

Classify the test document as the majority class of the k documents “nearest” to the query document.

Page 11: Classification II Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

11

Decision tree classifier

Example problem: decide whether to wait for a table at a restaurant, based on the following attributes:1. Alternate: is there an alternative restaurant nearby?

2. Bar: is there a comfortable bar area to wait in?

3. Fri/Sat: is today Friday or Saturday?

4. Hungry: are we hungry?

5. Patrons: number of people in the restaurant (None, Some, Full)

6. Price: price range ($, $$, $$$)

7. Raining: is it raining outside?

8. Reservation: have we made a reservation?

9. Type: kind of restaurant (French, Italian, Thai, Burger)

10. WaitEstimate: estimated waiting time (0-10, 10-30, 30-60, >60)

Page 12: Classification II Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

12

Decision tree classifier

Page 13: Classification II Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

Decision tree classifier

13

Page 14: Classification II Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

14

Shall I play tennis today?

Page 15: Classification II Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

15

Page 16: Classification II Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

16

How do we choose the best attribute?

Leaf nodes

Choose next attribute for splitting

Page 17: Classification II Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

17

Criterion for attribute selection

• Which is the best attribute?– The one which will result in the smallest tree– Heuristic: choose the attribute that produces

the “purest” nodes• Need a good measure of purity!

Page 18: Classification II Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

18

Information Gain

Which test is more informative?

Humidity

=>75%>75% =>20>20

Wind

Page 19: Classification II Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

19

Information Gain

Impurity/Entropy (informal)– Measures the level of impurity in a group

of examples

Page 20: Classification II Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

20

Impurity

Very impure group Less impure Minimum impurity

Page 21: Classification II Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

21

Entropy: a common way to measure impurity

• Entropy =

pi is the probability of class i

Compute it as the proportion of class i in the set.

i

ii pp 2log

Page 22: Classification II Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

22

2-Class Cases:

• What is the entropy of a group in which all examples belong to the same class?• entropy = - 1 log21 = 0

• What is the entropy of a group with 50% in either class?• entropy = -0.5 log20.5 – 0.5 log20.5 =1

Minimum impurity

Maximumimpurity

Page 23: Classification II Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

23

Information Gain

• We want to determine which attribute in a given set of training feature vectors is most useful for discriminating between the classes to be learned.

• Information gain tells us how useful a given attribute of the feature vectors is.

• We can use it to decide the ordering of attributes in the nodes of a decision tree.

Page 24: Classification II Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

24

Calculating Information Gain

996.030

16log

30

16

30

14log

30

1422

impurity

787.017

4log

17

4

17

13log

17

1322

impurity

Entire population (30 instances)17 instances

13 instances

(Weighted) Average Entropy of Children = 615.0391.030

13787.0

30

17

Information Gain= 0.996 - 0.615 = 0.38

391.013

12log

13

12

13

1log

13

122

impurity

Information Gain = entropy(parent) – [weighted average entropy(children)]

parententropy

childentropy

childentropy

Page 25: Classification II Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

25

e.g. based on information gain

Page 26: Classification II Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

Model Ensembles

Page 27: Classification II Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,
Page 28: Classification II Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,
Page 29: Classification II Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,
Page 30: Classification II Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

30

Random Forests

A variant of bagging proposed by Breiman

Classifier consists of a collection of decision tree-structure classifiers.

Each tree cast a vote for the class of input x.

Page 31: Classification II Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

31

• A simple algorithm for learning robust classifiers– Freund & Shapire, 1995– Friedman, Hastie, Tibshhirani, 1998

• Provides efficient algorithm for sparse visual feature selection– Tieu & Viola, 2000– Viola & Jones, 2003

• Easy to implement, doesn’t require external optimization tools. Used for many real problems in AI.

Boosting

Page 32: Classification II Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

32

• Defines a classifier using an additive model:

Boosting

Strong classifier

Weak classifier

WeightInput featurevector

Page 33: Classification II Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

33

• Defines a classifier using an additive model:

• We need to define a family of weak classifiers

Boosting

Strong classifier

Weak classifier

WeightInput featurevector

Selected from a family of weak classifiers

Page 34: Classification II Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

AdaboostInput: training samplesInitialize weights on samples

For T iterations:

Select best weak classifier based on weighted error

Update sample weights

Output: final strong classifier (combination of selected weak classifier predictions)

Page 35: Classification II Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

35

Each data point has

a class label:

wt =1and a weight:

+1 ( )

-1 ( )yt =

Boosting• It is a sequential procedure:

xt=1

xt=2

xt

Page 36: Classification II Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

36

Toy exampleWeak learners from the family of lines

h => p(error) = 0.5 it is at chance

Each data point has

a class label:

wt =1and a weight:

+1 ( )

-1 ( )yt =

Page 37: Classification II Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

37

Toy example

This one seems to be the best

Each data point has

a class label:

wt =1and a weight:

+1 ( )

-1 ( )yt =

This is a ‘weak classifier’: It performs slightly better than chance.

Page 38: Classification II Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

38

Toy example

Each data point has

a class label:

wt wt exp{-yt Ht}

We update the weights:

+1 ( )

-1 ( )yt =

Page 39: Classification II Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

39

Toy example

Each data point has

a class label:

wt wt exp{-yt Ht}

We update the weights:

+1 ( )

-1 ( )yt =

Page 40: Classification II Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

40

Toy example

Each data point has

a class label:

wt wt exp{-yt Ht}

We update the weights:

+1 ( )

-1 ( )yt =

Page 41: Classification II Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

41

Toy example

Each data point has

a class label:

wt wt exp{-yt Ht}

We update the weights:

+1 ( )

-1 ( )yt =

Page 42: Classification II Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

42

Toy example

The strong (non- linear) classifier is built as the combination of all the weak (linear) classifiers.

f1 f2

f3

f4

Page 43: Classification II Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

AdaboostInput: training samplesInitialize weights on samples

For T iterations:

Select best weak classifier based on weighted error

Update sample weights

Output: final strong classifier (combination of selected weak classifier predictions)

Page 44: Classification II Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

44

Boosting for Face Detection

Page 45: Classification II Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

Face detection

features?

classify+1 face

-1 not face

• We slide a window over the image• Extract features for each window• Classify each window into face/non-face

x F(x) y

? ?

Page 46: Classification II Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

What is a face?

• Eyes are dark (eyebrows+shadows)• Cheeks and forehead are bright.• Nose is bright

Paul Viola, Michael Jones, Robust Real-time Object Detection, IJCV 04

Page 47: Classification II Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

Basic feature extraction

• Information type:– intensity

• Sum over:– gray and white rectangles

• Output: gray-white• Separate output value for

– Each type– Each scale– Each position in the window

• FEX(im)=x=[x1,x2,…….,xn]

Paul Viola, Michael Jones, Robust Real-time Object Detection, IJCV 04

x120x357

x629 x834

Page 48: Classification II Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

Decision trees

• Stump: – 1 root– 2 leaves

• If xi > athen positive

else negative

• Very simple• “Weak classifier”

Paul Viola, Michael Jones, Robust Real-time Object Detection, IJCV 04

x120x357

x629 x834

Page 49: Classification II Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

Summary: Face detection

• Use decision stumps as week classifiers

• Use boosting to build a strong classifier

• Use sliding window to detect the face

x120x357

x629 x834

X234>1.3

-1Non-face

+1 Face

YesNo

Page 50: Classification II Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

50

Discriminant Function• It can be arbitrary functions of x, such as:

Nearest Neighbor

Decision Tree

LinearFunctions