The Glivenko-Cantelli Lemma and Kolmogorov’s Test With ... · where Kolmogorov’s work comes in handy. Kolmogorov and Nikolai Smirnov defined a statistic € D n =sup x F n (x)−F(x)

The Glivenko-Cantelli Lemma and Kolmogorov’s Test With Applications to Basketball Data

By Adam Levin

2

Table of Contents: Introduction ………………………………………………………………….…………..3 Proof of Gilvenko-Cantelli Lemma ……………………………………………………..4 Further Outline of Kolmogorov-Smirnov Tests ……...…………………………………4 Application #1 …………………………………………………………………………..5 Application #2 …………………………………………………………………………..8 Concluding Remarks ……………………………………………………..……………11 References …………………………………………………..…………………………12

3

Introduction:

Perhaps we encounter a sample of data and we would like to know how likely it is

that the sample comes from a particular defined distribution. Or maybe we have two

different samples and we’d like to know how likely it is that they were sampled from the

same distribution.

Fortunately, mathematicians Valery Gilvenko, Francesco Cantelli, and Andrey

Kolmorgorov have studied these questions extensively. Gilvenko and Cantelli combined

work on what is now known the Gilvenko-Cantelli theorem. This theorem states that the

empirical distribution function (or ECDF) defined as:

€

Fn (x) =1n

I(−∞,x ](Xi)i=1

n

∑ for a

random sample X1, . . . , Xn , converges uniformly to the distribution F(x), the

underlying distribution of X. Stated in math form this is:

€

supx∈R

Fn (x) − F(x) →0

almost surely.

Thus we know if our sample comes from a known distribution, the empirical

distribution of our sample will converge to the distribution function of this distribution

given n is large enough. In practice, however, we often have a limited number of data

points. Suppose we would like to make an assertion of how sure we are that a given

sample of data comes from a distribution that we might think our data resembles. This is

where Kolmogorov’s work comes in handy. Kolmogorov and Nikolai Smirnov defined a

statistic

€

Dn = supxFn (x) − F(x) and Kolmogorov defined a distribution of the

statistic and Smirnov published the table with the values of the distribution. There is also

a two-sample Kolmogorov-Smirnov test for testing how likely two samples come from

the same distribution.

I will discuss two applications of the Kolmogorov-Smirnov test. Both are on

basketball data. The first is a an application of the one-sample test and it answers this

question – “how much does the empirical distribution of the margins of victory of

college basketball games resemble a normal distribution?” The second is an application

of the two-sample test and it answers this question – “how different are NBA players

shooting accuracy when shooting a ‘cold’ shot vs. shooting a ‘hot’ shot?”

4

Proof of Gilvenko-Cantelli Lemma:

I have found many proofs online but I think it would be false if I just copied the

proofs. I have tried to understand them and I will outline the premise of the proofs to

show my understanding:

• By applying the Strong Law of Large Numbers we can show that for any

x∈R, Fn(x)→F(x) as n→∞. However, when both Fn and F are viewed as

functions of x, apparently the Strong Law of Large Numbers is not

sufficient to ensure convergence of

€

supxFn (x) − F(x) →0 .

• The proof involves partitioning the number line into subintervals and

making observations about distance between Fn(j) and F(j) where j is a

value in a subinterval. (This is obviously not anything resembling a proof

but it is all I can really say that is not just copying the proof.)

Further Outline of Kolmogorov-Smirnov Tests:

As stated in the introduction, the test statistic of the one-sample test is

€

Dn = supxFn (x) − F(x) . This can be interpreted as the maximum difference in

the two distribution functions. Since it is a difference of two probabilities, it takes values

in [0,1]. Next in the process of the test a distribution must be defined for Dn. Since we

are taking random samples, Dn is a random variable. So one way of defining a

distribution of Dn is to define a distribution under the null hypothesis. One would take

many random samples of size n from the hypothesized distribution, F(x), and calculate a

test statistic for each one. The result would be a distribution of D under the null

hypothesis, call it a null distribution. By seeing where our calculated value of Dn fits in

this distribution we would be able to see how likely it is that our sample came from the

hypothesized distribution. However, the trouble with this is that it is a very large

calculation. So, instead, the Kolmogorov-Smirnov test gives an approximation to the

null distribution.

5

Way back in 1944 Smirnov and Kolmogorov studied independently the

distribution of this random variable and defined its distribution as follows:

€

P(Dn ≥ λ) = λk=0

n(1−λ )

∑ n!(n − k)!k!

(λ +kn)k−1(1− λ − k

n)n−k for λ in (0,1). Smirnov also

obtained a limit distribution for Dn. As n→∞,

€

P( 118n

(6nDn +1)2 ≥ λ) = e−λ[1+O( 1n)]. From this it is obtained that the

Kolmogorov-Smirnov test called for the rejection of the null hypothesis - that the sample

does not come from the distribution F – at significance level α when

€

exp(−6nDn +1)2

18n#

$ %

&

' ( ≤ α .

In addition to the one-sample test, the K-S test also allows for a two-sample test.

In this case the null hypothesis is that the two samples come from different distributions.

The null hypothesis is rejected when

€

Dn,m = supxF1,n (x) − F2,m (x) > c(α) n +m

nm .

c(α) has values that are supplied in published tables (unfortunately, I have not found the

formula).

Application #1:

Very often we observe some data and we’d like to know how likely it is that this

data comes from a given distribution. The normal distribution shows up often in many

walks of life. It would not be very surprising to find that the margin of victories of

basketball games is normally distributed… let’s use Kolmogorov’s test to see if this is the

case.

First, let me introduce the data. The data points are all division I college

basketball games for three seasons worth of games (2010-11 through 2012-13 seasons).

15,796 games in total, n=15,796. The margin of victory can be negative or positive but

they are randomly drawn with respect to either the home or the away teams so that there

is no bias positively or negatively.

The mean of the sample is -.0422, the standard deviation of the sample is 14.902.

6

Here is a histogram of the data:

This distribution does look very normal at first glance but let’s investigate further. In the next figure there is the same histogram but now with an N(0,14.902) curve drawn over it.

Histogram of spreads

spreads

Density

-60 -40 -20 0 20 40 60

0.000

0.005

0.010

0.015

0.020

0.025

Histogram of spreads

spreads

Density

-60 -40 -20 0 20 40 60

0.000

0.005

0.010

0.015

0.020

0.025

7

(0 is used as the mean because there is no reason the mean of the population should be different than 0, and the sample mean of -.0422 is indeed not significantly different than 0 with 15,795 degrees of freedom. The sample variance is used to estimate the population variance) Again, it seems like the normal distribution is a very good fit to this data. But let’s investigate how good of a fit it is. Let’s use the K-S one sample test to see if we can reject the null hypothesis that the sample does not come from a normal distribution with µ=0 and σ=14.902. What follows is a plot showing the ECDF from the sample (in blue) and the ECDF from a random sample of 15,795 data points from a normal distribution with mean 0 and standard deviation 14.902 (in red) and the results of the two-sided K-S test.

0.00

0.25

0.50

0.75

1.00

-80 -40 0 40 80x

y

g1

2

8

One-sample Kolmogorov-Smirnov test

data: spreads D = 0.032, p-value = 1.732e-14

alternative hypothesis: two-sided So indeed we can be very sure that this distribution of spreads comes from a normal

distribution. The value of the statistic D was 0.032, meaning the greatest difference in the

two distribution functions was only .032. With n=15,705, this value rejects the two-sided

null hypothesis with an extremely high level of confidence.

Application #2:

As mentioned in the introduction I will now demonstrate a use of the two-sample

K-S test. The question I am interested in answering is “do NBA players shoot better

while ‘hot’”? By ‘hot’ I mean if they have made their last shot. This is the precise

definition of ‘hot’ for my application. A ‘cold’ shot is a shot that a player takes after

missing his last shot. The first shot of each game does not count as either a ‘hot shot’ or a

‘cold shot,’ it just establishes the player as either hot or cold. The two samples I will look

at will be one sample of hot shots and one sample of cold shots. I will include the

distance from the hoop the shot was taken from as a factor. The two samples will be of a

variable defined as follows:

€

Xi = DiΙmake −DiΙmiss where Di is the distance shot i

was taken from and Imake is an indicator variable that is equal to one if shot i went in and

0 if not. Likewise Imiss is an indicator variable that is equal to one if shot i did not go in

and 0 if it did.

The data is comprised of shots from all the games that took place over three NBA

seasons (2006-07, 2007-08, 2008-09). All together this covers 3,509 games. The ‘hot’

sample includes 231,301 shots, and the ‘cold’ sample includes 264,493 shots.

The hypothesis is that if players shoot with better accuracy when they are taking a

‘hot’ shot, the distribution function of the sample of hot shots will be smaller than the

distribution function of the sample of cold shots for x<0, as a greater proportion of the

cold shots are missed as compared with the hot shots. Then when x>0, the pattern will be

in reverse: the hot shot distribution will overtake the cold shot distribution as P(Xhot>x) >

P(Xcold>x).

9

Here is the plot of the two ECDFs. Hot shots are the red line and Cold shots are

the blue:

It appears as though our hypothesis is correct; the hot shot ECDF appears to be less than the ECDF of the cold shots. Most of the shots are taken in the -15.5<x<15.5 region so here is a plot of only this section:

0.00

0.25

0.50

0.75

1.00

-100 -50 0 50 100x

y

g1

2

10

My interpretation of this plot goes as follows: many of the shots are taken very close to the basket. The reason the ECDFs are vertical just above and below 0 is because the data does not include shot distances of less than .75 (just a handful of shots from .25). The red line begins to differentiate itself from the blue line at about x=-23 (missed shots from around the three-point line). Then the gap begins to widen and continues to until x=0 when the blue line is only slightly above the red line. This means that for shots taken at distances greater than just around the hoop there is a strong indication that the hot shots have better success. However, because the blue line is only just above the red line around x=0, the cold shots in the region |x| less than or equal to .75 have much greater success than the hot shots in the same region. This at first seems to contradict the hypothesis until one remembers something about shots taken very close to the hoop. I propose that this results comes from the fact that a large proportion of the shots very close to the hoop come just after an offensive rebound, and a large proportion of these shots from very close that are cold shots are taken after gathering ones own missed shot. These tip-ins off players’ own missed shots I believe explain the cold shots success relative to the hot shots in this region. Looking both functions at x=0, we see that if you are taking a hot shot it is just barely more likely that you will make the shot, no matter the distance. I will run the following K-S tests: two-sided to test whether the ECDFs are different from each other overall, and two one-sided tests to see whether the hot shots is significantly

0.4

0.6

0.8

-15 -10 -5 0 5 10 15x

y

g1

2

11

less than the cold shots for x<0 and whether the hot shots is significantly greater than the cold shots for x>0:

Two-sample Kolmogorov-Smirnov test

data: data2[, 1] and data2[, 2] D = 0.0293, p-value < 2.2e-16

alternative hypothesis: two-sided x<0:


data: subset(data2, data2[, 1] < 0)[, 1] and subset(data2, data2[, 2] < 0)[, 2] D = 0.0479, p-value < 2.2e-16

alternative hypothesis: two-sided x>0:


data: subset(data2, data2[, 1] > 0)[, 1] and subset(data2, data2[, 2] > 0)[, 2] D = 0.0395, p-value < 2.2e-16

alternative hypothesis: two-sided

All three of these tests indicate that the null hypothesis of the two distributions being the

same should be rejected. I believe this is strong evidence for a difference in shooting

accuracy when a shooter has made his last shot versus when he has missed it.

Concluding Remarks:

I hope you have enjoyed my paper. I tried my best to follow the proof of the

Gilvenko-Cantelli theorem, but I know I have fallen short of expectations if I was

expected to prove the theorem. I have tried to get to know the inner workings of

Kolmogorov’s test as well but I have to admit, completely understanding the formulation

of the distribution of the test statistic is beyond my capabilities. Nonetheless I have

enjoyed the project and come up with an application that I am exited by. Thank you.

12

References:

-http://en.wikipedia.org/wiki/Kolmogorov–Smirnov_test

-http://astrostatistics.psu.edu/su07/R/html/stats/html/ks.test.html

-http://stackoverflow.com/questions/6839956/how-do-i-plot-multiple-ecdfs-using-ggplot

-http://home.uchicago.edu/~amshaikh/webfiles/glivenko-cantelli.pdf

-http://www.encyclopediaofmath.org/index.php/Kolmogorov–Smirnov_test

Documents

The Glivenko-Cantelli Lemma and Kolmogorov’s Test With ... · where Kolmogorov’s work comes in handy. Kolmogorov and Nikolai Smirnov defined a statistic € D n =sup x F n (x)−F(x)