35
School of Information University of Michigan Discrete and continuous distributions

Discrete and continuous distributions

  • Upload
    trung

  • View
    46

  • Download
    1

Embed Size (px)

DESCRIPTION

Discrete and continuous distributions. Where does the binomial coefficient come from?. Suppose I 7 blue and pink balls, each of them uniquely marked so that I can distinguish them. A. B. C. D. E. F. G. - PowerPoint PPT Presentation

Citation preview

Page 1: Discrete and continuous distributions

School of InformationUniversity of Michigan

Discrete and continuous distributions

Page 2: Discrete and continuous distributions

Where does the binomial coefficient come from?

A B C D E F G

Suppose I 7 blue and pink balls, each of them uniquely marked so that I can distinguish them

How many different samples can I draw containing the same balls but in a different order?7!

I have 7 choices for the first spot, 6 choices for the second (since I’ve picked 1 and now have only 6 to choose from),5 choices for the third, etc.

7! = 7 * 6 * 5 * 4 * 3 * 2 * 1

ABC DE FG

Page 3: Discrete and continuous distributions

A B CD

Now if I am just counting the number of blue and pink balls, I don’t care about the order.So all possible arrangements (3!) of the pink balls look the same to me

A B CD EF G

A B CD EG F

A B CD FE G

A B CD FG E

A B CD GF E

A B CD GE F

So instead of having 7! combinations, we have 7!/3! combinations, because where before we had 6 different possibilities of uniquely ordering different pink balls – they are equivalent

Page 4: Discrete and continuous distributions

FE G

The same goes for the blue balls, if we can’t tell them apart, we lose a factor of 4!

Binomial coefficient =C(n,k)= -----------------------------------------------------------------number of ways of arranging n different things

(# of ways to arrange k things)*(# ways to arrange n-k things)

= -----------------n!

k! (n-k)!

Note that the binomial coefficient is symmetric – there are the same number of ways of choosing k or n-k things out of n

Page 5: Discrete and continuous distributions

We’ve got the coefficient, what is the distribution about?

Suppose your sample of 7 is actually drawn from a very large population (so large that it is basically unaffected by the removal of a

measly 7 balls) p = probability that ball is pink (1-p) = probability that ball is not pink (blue)

The probability that you draw a sample with 3 pink balls and 4 blue balls in a particular order e.g. (two pink followed by 3 blues, followed by a pink followed by a blue) is

prob(pink)*prob(pink)*prob(blue)*prob(blue)*prob(blue)*prob(pink)*prob(blue)

= p3*(1-p)4

Page 6: Discrete and continuous distributions

We’ve got the coefficient, what is the distribution about?

But the binomial distribution just tells us what the probability is of drawing e.g. 3 pink balls, not 3 pink balls at a particular point in the draw

The probability that you draw a sample with 3 pink balls and 4 blue balls in no particular order is

= C(7,3) p3*(1-p)4

+

….

Page 7: Discrete and continuous distributions

Probability distribution

A probability distribution lists all the possible outcomes and their probabilities

Outcomes are mutually exclusive e.g. drawing 0, 1, 2, 3… pink balls

Outcome probabilities sum to one e.g. when drawing 7 balls, the probability has to be

one of {0,1,2,3,4,5,6,7}

Denote p(x) to mean P(X=x), that is the probability that the outcome is x

Page 8: Discrete and continuous distributions

Binomial distribution

The binomial distribution tells us the probability of drawing k pink balls out of n

It depends on n = the number of trials (draws) k = the number of pink balls (successes) p = the probability of drawing a pink ball (success)

knk

knk

ppknk

n

ppk

nknp

−−

=

−⎟⎟⎠

⎞⎜⎜⎝

⎛=

)1()!(!

!

)1(),(

Page 9: Discrete and continuous distributions

the binomial distribution in R

dbinom(x, size, prob) if blue and pink balls

are equally likely> dbinom(3,7,0.5)

[1] 0.2734375

>barplot(dbinom(0:7,7,0.5),names.arg=0:7)

0 1 2 3 4 5 6 7

0.00

0.05

0.10

0.15

0.20

0.25

Page 10: Discrete and continuous distributions

what if p ≠ 0.5?

> barplot(dbinom(0:7,7,0.1),names.arg=0:7)

0 1 2 3 4 5 6 7

0.0

0.1

0.2

0.3

0.4

Page 11: Discrete and continuous distributions

What is the mean?

mean of a binomial distribution is just n*p in general = E(X) = x p(x)

0 * + 1 * + 2 * + 3 * + 4 * + 5 * + 6 * + 7 *

0.00

0.05

0.10

0.15

0.20

0.25

probabilities that

sum to 1

= 3.5

Page 12: Discrete and continuous distributions

What is the variance?

variance of a binomial distribution is justn*p*(1-p)

in general 2 = E[(X-)2] = (x-)2 p(x)

(-3.5)2 *+ + + + + +

0.00

0.05

0.10

0.15

0.20

0.25 probabilities that

sum to 1

(-2.5)2 *

+

(-1.5)2 *

(-0.5)2 * (0.5)2 *

(1.5)2 *

(2.5)2 *

(-3.5)2 *

Page 13: Discrete and continuous distributions

Which distribution has greater variance?

0 1 2 3 4 5 6 7

0.00

0.05

0.10

0.15

0.20

0.25

0 1 2 3 4 5 6 7

0.0

0.1

0.2

0.3

0.4

p = 0.5 p = 0.1

var = n*p*(1-p) = 7*0.1*0.9=7*0.09var = n*p*(1-p) = 7*0.5*0.5 = 7*0.25

Page 14: Discrete and continuous distributions

briefly comparing an experiment to a distribution

experiments = 1000

tosses = 7

for (i in 1:experiments) {

x = sample(c("H","T"), tosses, replace = T)

y[i] = sum(x=="H")

}

hist(y,breaks=-0.5:7.5)

lines(0:7,dbinom(0:7,7,0.5)*1000)

theoretical

distributionresult of 1000 trialsHistogram of y

y

Fre

quen

cy

0 2 4 6

050

100

150

200

250

300

Page 15: Discrete and continuous distributions

cumulative distribution

aka CDF = cumulative density function the probability that x is less than or equal to

some value a

( ) ( ) ( ) ( )∑=∑ ==≤=≤≤ axax

x xpxXaXaF PrPr

Page 16: Discrete and continuous distributions

cumulative distribution

P(X=x)

0 1 2 3 4 5 6 7

cumulative distribution

0.0

0.2

0.4

0.6

0.8

1.0

0 1 2 3 4 5 6 7

probability distribution

0.0

0.2

0.4

0.6

0.8

1.0

P(X≤x)

> barplot(dbinom(0:7,7,0.5),names.arg=0:7) > barplot(pbinom(0:7,7,0.5),names.arg=0:7)

Page 17: Discrete and continuous distributions

cumulative distribution

0 1 2 3 4 5 6 7

prob

abili

ty d

istr

ibut

ion

0.0

0.2

0.4

0.6

0.8

1.0

0 1 2 3 4 5 6 7

cum

ulat

ive

dist

ribut

ion

0.0

0.2

0.4

0.6

0.8

1.0

P(X=x) P(X≤x)

Page 18: Discrete and continuous distributions

example: surfers on a website

Your site has a lot of visitors 45% of whom are female

You’ve created a new section on gardening Out of the first 100 visitors, 55 are female. What is the probability that this many or more of

the visitors are female? P(X≥55) = 1 – P(X≤54) = 1-pbinom(54,100,0.45)

Page 19: Discrete and continuous distributions

another way to calculate cumulative probabilities

?pbinom P(X≤x) = pbinom(x, size, prob, lower.tail = T) P(X>x) = pbinom(x, size, prob, lower.tail = F)

> 1-pbinom(54,100,0.45)[1] 0.02839342

> pbinom(54,100,0.45,lower.tail=F)[1] 0.02839342

Page 20: Discrete and continuous distributions

female surfers visiting a section of a website

0 6 13 21 29 37 45 53 61 69 77 85 93

probability distribution

0.00

0.02

0.04

0.06

what is the area under the curve?

Page 21: Discrete and continuous distributions

cumulative distribution

0 6 13 21 29 37 45 53 61 69 77 85 93

cumulative distribution

0.0

0.2

0.4

0.6

0.8

1.0

<3 %

> 1-pbinom(54,100,0.45)

[1] 0.02839342

Page 22: Discrete and continuous distributions

Another discrete distribution: hypergeometric

randomly draw n elements without replacement from a set of N elements, r of which are S’s (successes) and (N-r) of which are F’s (failures)

hypergeometric random variable x is the number of S’s in the draw of n elements

⎟⎟⎠

⎞⎜⎜⎝

⎟⎟⎠

⎞⎜⎜⎝

⎛−−

⎟⎟⎠

⎞⎜⎜⎝

=

n

N

xn

rN

x

r

xp )(

Page 23: Discrete and continuous distributions

hypergeometric example

fortune cookies there are N = 20 fortune cookies r = 18 have a fortune, N-r = 2 are empty What is the probability that out of n = 5 cookies, s=5

have a fortune (that is we don’t notice that some cookies are empty)

> dhyper(5, 18, 2, 5) [1] 0.5526316

So there is a greater than 50% chance that we won’t notice.

Page 24: Discrete and continuous distributions

hypergeometric and binomial

When the population N is (very) big, whether one samples with or without replacement is pretty much the same

100 cookies, 10 of which are empty

1 2 3 4 5

0.0

0.1

0.2

0.3

0.4

0.5binomial

hypergeometric

number of full cookies out of 5

Page 25: Discrete and continuous distributions

code aside

> x = 1:5

> y1 = dhyper(1:5,90,10,5)

> y2 = dbinom(1:5,5,0.9)

> tmp = as.matrix(t(cbind(y1,y2)))

> barplot(tmp,beside=T,names.arg=x)

hypergeometric probability

binomial probability

Page 26: Discrete and continuous distributions

Poisson distribution

# of events in a given interval e.g. number of light bulbs burning out in a building in a year # of people arriving in a queue per minute

!)(

x

exp

x λλ −

=

λ = mean # of events in a given interval

Page 27: Discrete and continuous distributions

Example: Poisson distribution

You got a box of 1,000 widgets. The manufacturer says that the failure rate is 5

per box on average. Your box contains 10 defective widgets. What

are the odds?> ppois(9,5,lower.tail=F)[1] 0.03182806 Less than 3%, maybe the manufacturer is not

quite honest. Or the distribution is not Poisson?

Page 28: Discrete and continuous distributions

Poisson approximation to binomial

If n is large (e.g. > 100) and n*p is moderate (p should be small) (e.g. < 10), the Poisson is a good approximation to the binomial with λ = n*p

0 1 2 3 4 5 6 7 8 9 11 13 15

binomialPoisson

0.00

0.05

0.10

0.15

Page 29: Discrete and continuous distributions

Continuous distributions

Normal distribution (aka “bell curve”) fits many biological data well

e.g. height, weight

serves as an approximation to binomial, hypergeometric, Poisson

because of the Central Limit Theorem (more on this later) is important to inference problems

Page 30: Discrete and continuous distributions

sampling from a normal distribution

x <- rnorm(1000)

h <- hist(x, plot=F)

ylim <- range(0,h$density,dnorm(0))

hist(x,freq=F,ylim=ylim)

curve(dnorm(x),add=T)

Histogram of x

x

Density

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

Page 31: Discrete and continuous distributions

plotting on log axes

First of all, this is what a log function looks like

0 200 400 600 800 1000

0

1

2

3

4

5

6

7

x

y

> x = 1:1000

> y = log(x)

> plot(x,y)

y = log(x) is equivalent to

x = exp(y) = ey

Page 32: Discrete and continuous distributions

plotting the function y = e-x

> x = 1:20 > y = exp(-x) > plot(x,y)

5 10 15 20

0.0

0.1

0.2

0.3

x

y

hard to tell what’s going on here, all the values are so close to 0

Page 33: Discrete and continuous distributions

changing the axes

1 2 5 10 20

1 e-09

1 e-07

1 e-05

1 e-03

1 e-01

x

y

both x and y on a log scale

5 10 15 20

1 e-09

1 e-07

1 e-05

1 e-03

1 e-01

x

y

just y on a log scale

> plot(x,y,log="xy")> plot(x,y,log="y")

Page 34: Discrete and continuous distributions

from PS: CO2 levels over last ~ 50 years

Page 35: Discrete and continuous distributions

CO2 levels over last ~ 400,000 years