27
Clustering and Probability (Chap 7)

Clustering and Probability (Chap 7)

  • Upload
    gwyn

  • View
    51

  • Download
    0

Embed Size (px)

DESCRIPTION

Clustering and Probability (Chap 7). Review from Last Lecture. Defined the K-means problem for formalizing the notion of clustering. Discussed the K-means algorithm. Noted that the K-means algorithm was “quite good” in discovering “concepts” from data (based on features). - PowerPoint PPT Presentation

Citation preview

Page 1: Clustering and Probability (Chap 7)

Clustering and Probability (Chap 7)

Page 2: Clustering and Probability (Chap 7)

Review from Last Lecture

Defined the K-means problem for formalizing the notion of clustering.

Discussed the K-means algorithm.

Noted that the K-means algorithm was “quite good” in discovering “concepts” from data (based on features).

Noted the important distinction between “attributes” and “features”.

Page 3: Clustering and Probability (Chap 7)

Example of K-means -1

Measure 1 Measure 2

Patient 1 1 1

Patient 2 2 1

Patient 3 3 4

Patient 4 4 5

Let initial centroids be C1 = (1,1) and C2 = (2,1)

Page 4: Clustering and Probability (Chap 7)

Example of K-means-2

Measure 1

Measure 2

Dist to C1 (1,1)

Dist to C2 (2,1)

nearest

Patient 1 1 1 0 1 C1

Patient 2 2 1 1 0 C2

Patient 3 3 4 3.6 3.16 C2

Patient 4 4 5 5 4.47 C2

C1: = (1,1); C2 = ((2+3+4)/3, (1+4+5)/3)) = (3,3.33)

Page 5: Clustering and Probability (Chap 7)

Example of K-means-3

Measure1

Measure2

Dist to C1 (1,1)

Dist to C2 (3,3.33)

nearest

Patient 1

1 1 0 3.07 C1

Patient 2

2 1 1 2.54 C1

Patient 3

3 4 3.6 0.67 C2

Patient 4

4 5 5 2.10 C2

C1: = ((1+2)/2, (1+1)/2) = (1.5,1); C2 = ((3 +4)/2), (4+5)/2) = (3.5, 4.5)

Page 6: Clustering and Probability (Chap 7)

Example of K-means-4

Measure1

Measure2

Dist to C1 (1.5,1)

Dist to C2 (3.5.4.5)

nearest

Patient 1

1 1 0.5 4.3 C1

Patient 2

2 1 0.5 3.8 C1

Patient 3

3 4 3.35 0.70 C2

Patient 4

4 5 4.61 1.59 C2

C1: = ((1+2)/2, (1+1)/2) = (1.5,1); C2 = ((3 +4)/2), (4+5)/2) = (3.5, 4.5)

Page 7: Clustering and Probability (Chap 7)

Example: 2 Clusters

c

c

c

c

A(-1,2) B(1,2)

C(-1,-2) D(1,-2)(0,0)

K-means Problem: Solution is (0,2) and (0,-2) and the clusters are {A,B} and{C,D}

K-means Algorithm: Suppose the initial centroids are (-1,0) and (1,0) then{A,C} and {B,D} end up as the two clusters.

4

2

Page 8: Clustering and Probability (Chap 7)

Several other issues regarding clustering

How do you select the initial centroids?

How do you select the right number of clusters ?

How do you deal with non-Euclidean distance/similarity measures ?

Other approaches (like hierarchical, spectral etc.)

Curse of high-dimensionality.

Page 9: Clustering and Probability (Chap 7)

Question

S-Length S-Width P-Length P-Width Flower

Small Medium Small Medium A (SetosA)

Medium Medium Large Large O(Versicolor)

Medium Small Small Large I (Virginica)

Large Large Medium Small A

Large Small Medium Small ?

What should the “prediction” be for the flower ?

Page 10: Clustering and Probability (Chap 7)

Prediction and Probability

When we make predictions we should assign “probabilities” with the prediction.

Examples: 20% chance it will rain tomorrow. 50% chance that the tumor is malignant. 60% chance that the stock market will fall by the end of

the week. 30% that the next president of the United States will be a

Democrat. 0.1% chance that the user will click on a banner-ad.

How do we assign probabilities to complex events.. using smart data algorithms…and counting.

Page 11: Clustering and Probability (Chap 7)

Probability Basics

Probability is a deep topic…..but for most cases the rules are straightforward to apply..

Terminology Experiment Sample Space Events Probability Rules of probability Conditional Probability Bayes Rule

Page 12: Clustering and Probability (Chap 7)

Probability: Sample Space

Consider an experiment and let S be the space of possible outcomes.

Example: Experiment is tossing a coin; S={h,t}

Experiment is rolling a pair of dice: S={(1,1),(1,2),…(6,6)}

Experiment is a race consisting of three cars: 1,2 and 3. The sample space is {(1,2,3),(1,3,2),(2,1,3),(2,3,1),(3,1,2),(3,2,1)}

Page 13: Clustering and Probability (Chap 7)

Probabilities

Let Sample Space S = {1,2,…m}

Consider numbers

pi is the probability that the outcome of the experiment is i.

Suppose we toss a fair coin. Sample space is S={h,t}. Then ph = 0.5 and pt = 0.5.

Page 14: Clustering and Probability (Chap 7)

Probability

Experiment: Will it rain or not in Sydney : S = {rain, no-rain} Prain = 138/365 =0.38; Pno-rain = 227/365

Assigning (or rather how to) probabilities is a deep philosophical problem. What is the probability that the “green object standing

outside my house is a burglar dressed in green.”

Page 15: Clustering and Probability (Chap 7)

Probability

An Event A is a set of possible outcomes of the experiment. Thus A is a subset of S.

Let A be the event of getting a seven when we roll a pair of dice. A = {(1,6),(6,1),(2,5),(5,2),(4,3),(3,4) }

P(A) = 6/36 = 1/6

In general

Page 16: Clustering and Probability (Chap 7)

Probability

The sample space S and events are “sets”.

P(S) = 1;

P(Φ) = 0

Addition:

Often

Complement:

Page 17: Clustering and Probability (Chap 7)

Example

Suppose the probability of raining today is 0.4 and tomorrow is also 0.4 and on both days is 0.1. What is the probability it does not rain on either day.

S={(R,N), (R,R),(N,N),(N,R)}Let A be the event that it will rain today and

B it will rain tomorrow. Then A ={(R,N), (R,R)} ; B={(N,R),(R,R)}

Rain at least today or tomorrow:Will not rain on either day: 1 – 0.7 = 0.3

Page 18: Clustering and Probability (Chap 7)

Conditional Probability

One of the most important concepts in all of Data Mining and Machine Learning

P(A|B) = P(AB)/P(B) ..assuming P(B) not equal 0. Conditional probability of A given B has occurred.

Probability it will rain tomorrow given it has rained today. P(A|B) = P(AB)/(B) = 0.1/0.4 = ¼ = 0.25 In general P(A|B) is not equal to P(B|A)

Page 19: Clustering and Probability (Chap 7)

We need conditional probability to answer….

S-Length S-Width P-Length P-Width Flower

Small Medium Small Medium A (SetosA)

Medium Medium Large Large O(Versicolor)

Medium Small Small Large I (Virginica)

Large Large Medium Small A

Large Small Medium Small ?

What should the “prediction” be for the flower ?

Page 20: Clustering and Probability (Chap 7)

Bayes Rule

P(A|B) = P(AB)/P(B); P(B|A) = P(BA)|P(A)

Now P(AB) = P(BA)

Thus P(A|B)P(B) = P(B|A)P(A)

Thus P(A|B) = [P(B|A)P(A)]/[P(B)] This is called Bayes Rule Basis of almost all prediction Latest theories hypothesize that human memory and

action is Bayes rule in action.

Page 21: Clustering and Probability (Chap 7)

Bayes Rule

PriorPosterior

Page 22: Clustering and Probability (Chap 7)

Bayes Rule: Example

The ASX market goes up 60% of the days of a year. 40% of the time it stays the same or goes down. The day the ASX is up, there is a 50% chance that the Shanghai Index is up. On other days there is 30% chance that Shanghai goes up. Suppose The Shanghai market is up. What is the probability that ASX was up.

Define A1 as “ASX is up”; A2 is “ASX is not up”Define S1 as “Shanghai is up”; S2 is “Shanghai is not up”

We want to calculate P(A1|S1) ?

P(A1) = 0.6; P(A2) = 0.4; P(S1|A1) = 0.5; P(S1|A2) = 0.3

P(S2|A1) = 1 – P(S1|A1) = 0.5; P(S2|A2) = 1 –P(S1|A2) = 0.7;

Page 23: Clustering and Probability (Chap 7)

Bayes Rule: Example

We want to calculate P(A1|S1) ? P(A1) = 0.6; P(A2) = 0.4; P(S1|A1) = 0.5; P(S1|A2) = 0.3P(S2|A1) = 1 – P(S1|A1) = 0.5; P(S2|A2) = 1 –P(S1|A2) = 0.7;

P(A1|S1) = P(S1|A1)P(A1)/(P(S1))

How do we calculate P(S1) ?

Page 24: Clustering and Probability (Chap 7)

Bayes Rule: Example

P(S1) = P(S1,A1) + P(S1,A2) [Key Step] = P(S1|A1)P(A1) + P(S1|A2)P(A2) = 0.5 x 0.6 + 0.3 x 0.4 = 0.42

Finally,

P(A1|S1) = P(S1|A1)P(A1)/P(S1) = (0.5 x 0.6)/0.42 = 0.71

Page 25: Clustering and Probability (Chap 7)

Example: Iris Flower

F=Flower; SL=Sepal Length; SW = Sepal Width; PL=Petal Length; PW =Petal Width Data

Large Small Medium

Small ?

choose themaximum

Page 26: Clustering and Probability (Chap 7)

Example: Iris Flower

So how do we compute: P(Data|F=A) ?

This is a a non-trivial question…[subject to much research]

How many times does “Data” appear in the “database” when F=A.

In this case “Data” is a 4-dimensional “data vector.” Each component takes 3 values (small, medium, large). Thus number of combinations 3^4 = 81.

Page 27: Clustering and Probability (Chap 7)

Example: Iris Flower

Conditional Independence

P(Data|F=A) = P(SL=Large,SW=Small,PL=Medium,PW=Small|F=A) ~= P(SL=Large|F=A)P(SW=Small|F=A)P(PL=Medium|

A)P(PW=Small|A)

The above is an assumption to make the “computation easier.” Surprisingly evidence suggest that it works reasonably well in

practice.

This prediction method (which exploits conditional independence) is called “Naïve Bayes Classifier.”