Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
The Glivenko-Cantelli Lemma and Kolmogorov’s Test With Applications to Basketball Data
By Adam Levin
2
Table of Contents: Introduction ………………………………………………………………….…………..3 Proof of Gilvenko-Cantelli Lemma ……………………………………………………..4 Further Outline of Kolmogorov-Smirnov Tests ……...…………………………………4 Application #1 …………………………………………………………………………..5 Application #2 …………………………………………………………………………..8 Concluding Remarks ……………………………………………………..……………11 References …………………………………………………..…………………………12
3
Introduction:
Perhaps we encounter a sample of data and we would like to know how likely it is
that the sample comes from a particular defined distribution. Or maybe we have two
different samples and we’d like to know how likely it is that they were sampled from the
same distribution.
Fortunately, mathematicians Valery Gilvenko, Francesco Cantelli, and Andrey
Kolmorgorov have studied these questions extensively. Gilvenko and Cantelli combined
work on what is now known the Gilvenko-Cantelli theorem. This theorem states that the
empirical distribution function (or ECDF) defined as:
€
Fn (x) =1n
I(−∞,x ](Xi)i=1
n
∑ for a
random sample X1, . . . , Xn , converges uniformly to the distribution F(x), the
underlying distribution of X. Stated in math form this is:
€
supx∈R
Fn (x) − F(x) →0
almost surely.
Thus we know if our sample comes from a known distribution, the empirical
distribution of our sample will converge to the distribution function of this distribution
given n is large enough. In practice, however, we often have a limited number of data
points. Suppose we would like to make an assertion of how sure we are that a given
sample of data comes from a distribution that we might think our data resembles. This is
where Kolmogorov’s work comes in handy. Kolmogorov and Nikolai Smirnov defined a
statistic
€
Dn = supxFn (x) − F(x) and Kolmogorov defined a distribution of the
statistic and Smirnov published the table with the values of the distribution. There is also
a two-sample Kolmogorov-Smirnov test for testing how likely two samples come from
the same distribution.
I will discuss two applications of the Kolmogorov-Smirnov test. Both are on
basketball data. The first is a an application of the one-sample test and it answers this
question – “how much does the empirical distribution of the margins of victory of
college basketball games resemble a normal distribution?” The second is an application
of the two-sample test and it answers this question – “how different are NBA players
shooting accuracy when shooting a ‘cold’ shot vs. shooting a ‘hot’ shot?”
4
Proof of Gilvenko-Cantelli Lemma:
I have found many proofs online but I think it would be false if I just copied the
proofs. I have tried to understand them and I will outline the premise of the proofs to
show my understanding:
• By applying the Strong Law of Large Numbers we can show that for any
x∈R, Fn(x)→F(x) as n→∞. However, when both Fn and F are viewed as
functions of x, apparently the Strong Law of Large Numbers is not
sufficient to ensure convergence of
€
supxFn (x) − F(x) →0 .
• The proof involves partitioning the number line into subintervals and
making observations about distance between Fn(j) and F(j) where j is a
value in a subinterval. (This is obviously not anything resembling a proof
but it is all I can really say that is not just copying the proof.)
Further Outline of Kolmogorov-Smirnov Tests:
As stated in the introduction, the test statistic of the one-sample test is
€
Dn = supxFn (x) − F(x) . This can be interpreted as the maximum difference in
the two distribution functions. Since it is a difference of two probabilities, it takes values
in [0,1]. Next in the process of the test a distribution must be defined for Dn. Since we
are taking random samples, Dn is a random variable. So one way of defining a
distribution of Dn is to define a distribution under the null hypothesis. One would take
many random samples of size n from the hypothesized distribution, F(x), and calculate a
test statistic for each one. The result would be a distribution of D under the null
hypothesis, call it a null distribution. By seeing where our calculated value of Dn fits in
this distribution we would be able to see how likely it is that our sample came from the
hypothesized distribution. However, the trouble with this is that it is a very large
calculation. So, instead, the Kolmogorov-Smirnov test gives an approximation to the
null distribution.
5
Way back in 1944 Smirnov and Kolmogorov studied independently the
distribution of this random variable and defined its distribution as follows:
€
P(Dn ≥ λ) = λk=0
n(1−λ )
∑ n!(n − k)!k!
(λ +kn)k−1(1− λ − k
n)n−k for λ in (0,1). Smirnov also
obtained a limit distribution for Dn. As n→∞,
€
P( 118n
(6nDn +1)2 ≥ λ) = e−λ[1+O( 1n)]. From this it is obtained that the
Kolmogorov-Smirnov test called for the rejection of the null hypothesis - that the sample
does not come from the distribution F – at significance level α when
€
exp(−6nDn +1)2
18n#
$ %
&
' ( ≤ α .
In addition to the one-sample test, the K-S test also allows for a two-sample test.
In this case the null hypothesis is that the two samples come from different distributions.
The null hypothesis is rejected when
€
Dn,m = supxF1,n (x) − F2,m (x) > c(α) n +m
nm .
c(α) has values that are supplied in published tables (unfortunately, I have not found the
formula).
Application #1:
Very often we observe some data and we’d like to know how likely it is that this
data comes from a given distribution. The normal distribution shows up often in many
walks of life. It would not be very surprising to find that the margin of victories of
basketball games is normally distributed… let’s use Kolmogorov’s test to see if this is the
case.
First, let me introduce the data. The data points are all division I college
basketball games for three seasons worth of games (2010-11 through 2012-13 seasons).
15,796 games in total, n=15,796. The margin of victory can be negative or positive but
they are randomly drawn with respect to either the home or the away teams so that there
is no bias positively or negatively.
The mean of the sample is -.0422, the standard deviation of the sample is 14.902.
6
Here is a histogram of the data:
This distribution does look very normal at first glance but let’s investigate further. In the next figure there is the same histogram but now with an N(0,14.902) curve drawn over it.
Histogram of spreads
spreads
Density
-60 -40 -20 0 20 40 60
0.000
0.005
0.010
0.015
0.020
0.025
Histogram of spreads
spreads
Density
-60 -40 -20 0 20 40 60
0.000
0.005
0.010
0.015
0.020
0.025
7
(0 is used as the mean because there is no reason the mean of the population should be different than 0, and the sample mean of -.0422 is indeed not significantly different than 0 with 15,795 degrees of freedom. The sample variance is used to estimate the population variance) Again, it seems like the normal distribution is a very good fit to this data. But let’s investigate how good of a fit it is. Let’s use the K-S one sample test to see if we can reject the null hypothesis that the sample does not come from a normal distribution with µ=0 and σ=14.902. What follows is a plot showing the ECDF from the sample (in blue) and the ECDF from a random sample of 15,795 data points from a normal distribution with mean 0 and standard deviation 14.902 (in red) and the results of the two-sided K-S test.
0.00
0.25
0.50
0.75
1.00
-80 -40 0 40 80x
y
g1
2
8
One-sample Kolmogorov-Smirnov test
data: spreads D = 0.032, p-value = 1.732e-14
alternative hypothesis: two-sided So indeed we can be very sure that this distribution of spreads comes from a normal
distribution. The value of the statistic D was 0.032, meaning the greatest difference in the
two distribution functions was only .032. With n=15,705, this value rejects the two-sided
null hypothesis with an extremely high level of confidence.
Application #2:
As mentioned in the introduction I will now demonstrate a use of the two-sample
K-S test. The question I am interested in answering is “do NBA players shoot better
while ‘hot’”? By ‘hot’ I mean if they have made their last shot. This is the precise
definition of ‘hot’ for my application. A ‘cold’ shot is a shot that a player takes after
missing his last shot. The first shot of each game does not count as either a ‘hot shot’ or a
‘cold shot,’ it just establishes the player as either hot or cold. The two samples I will look
at will be one sample of hot shots and one sample of cold shots. I will include the
distance from the hoop the shot was taken from as a factor. The two samples will be of a
variable defined as follows:
€
Xi = DiΙmake −DiΙmiss where Di is the distance shot i
was taken from and Imake is an indicator variable that is equal to one if shot i went in and
0 if not. Likewise Imiss is an indicator variable that is equal to one if shot i did not go in
and 0 if it did.
The data is comprised of shots from all the games that took place over three NBA
seasons (2006-07, 2007-08, 2008-09). All together this covers 3,509 games. The ‘hot’
sample includes 231,301 shots, and the ‘cold’ sample includes 264,493 shots.
The hypothesis is that if players shoot with better accuracy when they are taking a
‘hot’ shot, the distribution function of the sample of hot shots will be smaller than the
distribution function of the sample of cold shots for x<0, as a greater proportion of the
cold shots are missed as compared with the hot shots. Then when x>0, the pattern will be
in reverse: the hot shot distribution will overtake the cold shot distribution as P(Xhot>x) >
P(Xcold>x).
9
Here is the plot of the two ECDFs. Hot shots are the red line and Cold shots are
the blue:
It appears as though our hypothesis is correct; the hot shot ECDF appears to be less than the ECDF of the cold shots. Most of the shots are taken in the -15.5<x<15.5 region so here is a plot of only this section:
0.00
0.25
0.50
0.75
1.00
-100 -50 0 50 100x
y
g1
2
10
My interpretation of this plot goes as follows: many of the shots are taken very close to the basket. The reason the ECDFs are vertical just above and below 0 is because the data does not include shot distances of less than .75 (just a handful of shots from .25). The red line begins to differentiate itself from the blue line at about x=-23 (missed shots from around the three-point line). Then the gap begins to widen and continues to until x=0 when the blue line is only slightly above the red line. This means that for shots taken at distances greater than just around the hoop there is a strong indication that the hot shots have better success. However, because the blue line is only just above the red line around x=0, the cold shots in the region |x| less than or equal to .75 have much greater success than the hot shots in the same region. This at first seems to contradict the hypothesis until one remembers something about shots taken very close to the hoop. I propose that this results comes from the fact that a large proportion of the shots very close to the hoop come just after an offensive rebound, and a large proportion of these shots from very close that are cold shots are taken after gathering ones own missed shot. These tip-ins off players’ own missed shots I believe explain the cold shots success relative to the hot shots in this region. Looking both functions at x=0, we see that if you are taking a hot shot it is just barely more likely that you will make the shot, no matter the distance. I will run the following K-S tests: two-sided to test whether the ECDFs are different from each other overall, and two one-sided tests to see whether the hot shots is significantly
0.4
0.6
0.8
-15 -10 -5 0 5 10 15x
y
g1
2
11
less than the cold shots for x<0 and whether the hot shots is significantly greater than the cold shots for x>0:
Two-sample Kolmogorov-Smirnov test
data: data2[, 1] and data2[, 2] D = 0.0293, p-value < 2.2e-16
alternative hypothesis: two-sided x<0:
Two-sample Kolmogorov-Smirnov test
data: subset(data2, data2[, 1] < 0)[, 1] and subset(data2, data2[, 2] < 0)[, 2] D = 0.0479, p-value < 2.2e-16
alternative hypothesis: two-sided x>0:
Two-sample Kolmogorov-Smirnov test
data: subset(data2, data2[, 1] > 0)[, 1] and subset(data2, data2[, 2] > 0)[, 2] D = 0.0395, p-value < 2.2e-16
alternative hypothesis: two-sided
All three of these tests indicate that the null hypothesis of the two distributions being the
same should be rejected. I believe this is strong evidence for a difference in shooting
accuracy when a shooter has made his last shot versus when he has missed it.
Concluding Remarks:
I hope you have enjoyed my paper. I tried my best to follow the proof of the
Gilvenko-Cantelli theorem, but I know I have fallen short of expectations if I was
expected to prove the theorem. I have tried to get to know the inner workings of
Kolmogorov’s test as well but I have to admit, completely understanding the formulation
of the distribution of the test statistic is beyond my capabilities. Nonetheless I have
enjoyed the project and come up with an application that I am exited by. Thank you.
12
References:
-http://en.wikipedia.org/wiki/Kolmogorov–Smirnov_test
-http://astrostatistics.psu.edu/su07/R/html/stats/html/ks.test.html
-http://stackoverflow.com/questions/6839956/how-do-i-plot-multiple-ecdfs-using-ggplot
-http://home.uchicago.edu/~amshaikh/webfiles/glivenko-cantelli.pdf
-http://www.encyclopediaofmath.org/index.php/Kolmogorov–Smirnov_test