Upload
joseph-delay
View
10
Download
1
Embed Size (px)
Citation preview
A Time Series Analysis for Predicting Basketball Statistics
Joseph DeLayDepartment of Statistics and Actuarial Science, University of Iowa,
Iowa City, IA 52242
Summary of findings
In this report, we looked at a game-by-game analysis of points scored by Chicago Bulls star,
Derrick Rose. Using ARIMA methods, we planned to find the best fitting model for the data
in order to forecast future point-scoring potential. Our analysis shows that the best fitting
model is an IMA(1,1) model, where the data had to be differenced once. When running the
forecast model, the mean narrows out to a straight line at Rose’s average points per game.
Though the results are somewhat inconclusive, it can be said with certainty that adding
more data points, potentially over multiple years, would improve this model’s accuracy.
Abstract and Personal Interest
In the NBA today, there are hundreds of statistics that professional statisticians
follow. Part of their jobs are to create statistical models to best predict player performance
so they can determine which players to keep or cut from the roster, as well as which ones
to pay the large contracts. In this paper, I plan to create my own models to determine a way
to predict a player’s points in a game based on his previous point totals. I will be using data
on Derrick Rose, star point guard of the Chicago Bulls, that I received from espn.com. My
goal is to discover an ARIMA model that can be used to forecast Rose’s future point totals.
Players change all the time in the NBA, from rookies to seasoned veterans. Injuries
occur, players are traded to different teams or dropped from their team all together, and
players are added to teams to take the place of someone already starting. For these
reasons, as well as a growing NCAA basketball market and changing NBA rules, it has
become increasingly difficult to predict players’ statistics. I will be taking data on Derrick
DeLay 1
Rose on a game-by-game basis (not taking the games that he didn’t play into consideration)
and using my models to predict his points per game.
I have an extreme personal interest in this project because it essentially opens the
door for the work I want to do in the future. This fall I was offered an opportunity to work
with Marc Grossman, the head statistician of the Chicago Bulls, for the two Chicago land
college basketball teams he does statistics for. Although my initial work with him will be
nothing more than petty tasks and fact and data checking, I want to use that experience and
what I have learned already to conduct independent research on basketball statistics and
one day become a professional sports statistician. This project will hopefully provide me
with resume-level experience to show potential future employers.
Background of the Scientific Questions
The data I’m using is going to possess characteristics of a normally distributed data set. We
define the data set by
D = d1, d2, …, dn ~ N(i,i2), where i=1,2,…,n
with {D} representing the data set of Rose’s in-game statistics, representing the average
of each statistic i (only including games that he actually played), and i2 representing the
square of the standard deviation of each statistic i. This data will be normally distributed
with mean =(dn)/n, n=number of games Rose played in the 2014-15 NBA season and
expected value E(dn)=.
In this data set, I will be using 51 points corresponding to the games that Rose played for
the Chicago Bulls throughout the 2014-15 NBA season. I decided to cut out the data with
games where he did not play since those games show all statistics as zero and that would
have not only heavily skewed the data negatively, but those game numbers probably would
have been calculated as outliers.
The data I used for this project, I received from espn.com at
http://espn.go.com/nba/player/gamelog/_/id/3456/derrick-rose
Technical Analyses with Interpretations
DeLay 2
I started by plotting the time series. It was difficult whether or not to expect a
seasonal pattern in the data because it was only recorded over an 8 month period out of the
year. However, as you can see below, there is somewhat of a normal distribution, meaning
there was an abundance of high-point games a little before the midway point of the season.
This time of the season is usually when a team’s best players are playing the most minutes
per game, so higher point totals can be expected. Keep in mind that, though there are 82
games in the NBA season, Rose only played 51 games this year.
DeLay 3
After taking an initial look at the time series, a histogram can be plotted to verify the
assumption that the data would be normally distributed. The bell curve was almost an
exact fit, with the tip of the curve at =17.7 points per game.
I then computed the Box-Cox graph. The maximum likelihood estimate of =1, as well as the normality of the histogram, suggests that we do not need to transform the data.
DeLay 4
From these graphs, we can tell almost with certainty is that we will need to difference the
data. The random significance at lag 8 for both the ACF and PACF show that there are
trends and seasonality in the data that differencing can eliminate. When the data is
differenced once, as shown below, both ACF and PACF show clear cutoff points at lags 1 and
3, respectively, which is what we are looking for.
DeLay 5
We can now look at different tests to check for stationarity and determine if
differencing is necessary for the data. The first test we’ll look at is the Box-Ljung test, which
shows if there is significant evidence of non-zero correlations for a set number of lags. If
the p-value is less than 0.05 for the given parameters, then the series is stationary. The
second test, which is used along with the Box-Ljung test, is the Augmented Dickey-Fuller
test, which suggests that if the p-value for the given inputs is less than 0.05, then the data is
stationary and does not need to be differenced. We will input the non-differenced data first,
checking the first 20 lags for non-zero correlations.
DeLay 6
Here, the null hypothesis is that the data are non-stationary and need differencing. Both
tests compute a p-value greater than 0.05, so we fail reject the null hypothesis. This
suggests that we do need to difference the data; however, we do not know how many times
we will need to difference the data, so we will run another test with d=1, or the first
difference.
This run of the tests computes p-values less than 0.05 for both, meaning we can reject the
null hypothesis that the data are non-stationary and do not need differencing. The data is
now stationary and d=1 in the ARIMA(p,d,q) model. Furthermore, the following EACF
provides evidence for the rest of the model, ARMA(p,q). With a vertex at (0,1), this suggests
the parameters are p=0 and q=1.
DeLay 7
In order to verify the model, we can use the ARIMA estimation command in R. This
command returns the best ARIMA model according to either AIC, AICc, or BIC value. The
function conducts a search over possible models within the order constraints provided
(Hyndman). After running the function, it returned the same model we identified, thus our
verification is complete, with ARIMA(0,1,1), or IMA(1,1). With the MA coefficient,
1=-0.8304, this model can be represented as
Yt-Yt-1=et+0.8304et-1
The final step of this project is to create a forecasted model for future events, in this case
the number of points Rose will score in future games. We fit time series display models for
both the differenced data and the residuals of the data, as well as the forecasted model. As
the residuals are plotted, all the spikes are now within the significance limits, so they
appear to be white noise (OTexts). The top left graph on the next page is the time series
display for the differenced data, the top right graph represents the residuals from the
differenced data, the bottom graph represents the residuals of the forecasted data, and the
last graph (after the next page) is the plot of the forecasted data points.
DeLay 8
DeLay 9
Conclusion
The goal of this project was to find an ARIMA model that could be used to predict Derrick
Rose’s point totals for future games. The computations produced an IMA(1,1) model,
leaving the forecasted data somewhat inconclusive. The reason for the flat tail in this graph
is because with only 51 data points to work with, the new data points are simply going to
quickly converge to the average (17.7 points per game). With more data, say, a couple years
worth, this model could produce more fluctuating data points from Yt+1 until it again begins
to average out, at a slower pace. Now, the reason for the discontinuous increase in the
confidence intervals at 80% and 95% is, again, with the few number of data points we used,
there was not much of a gap (by the NBA’s standards) between Derrick Rose’s lowest point
total of the season (not counting games he didn’t play) and his highest of the season, so it is
left close to 100% that he will score between those numbers on any given day. Models like
this can be used for the short term, but much more data must be added to guarantee its
usefulness in the long term.
DeLay 10
Data used and relevant R codes
Data: Derrick Rose game-by-game points for the 2014-2015 NBA regular season, provided
by http://espn.go.com/nba/player/gamelog/_/id/3456/derrick-rose
data=read.csv("DRose2014-15StatsonGP.csv", head=TRUE) #set working directory and call
# install TSA and forecast packageslibrary(TSA)library(forecast)
minutes=data$MINrebounds=data$REBassists=data$ASTblocks=data$BLKsteals=data$STLpersonals=data$PFturnovers=data$TOpoints=data$PTSGM=data$GMNUM
points=as.matrix(points)x=as.vector(t(points))y=subset(x,x>0)z=ts(y,start=1,frequency=1)plot(y,ylab="Points",main="Derrick Rose points in regular season games") hist(y,prob=TRUE,xlab="Points",main="Derrick Rose points in regular season games") lines(density(y)) par(mfrow=c(2,1))acf(y,main="ACF of 'Points'")pacf(y,main="PACF of 'Points'")eacf(y)
bc=BoxCox.ar(y,method="mle")
Box.test(y,lag=20,type="Ljung-Box") adf.test(y,alternative="stationary")
acf(diff(y),main="ACF of differenced 'Points'")pacf(diff(y),main="PACF of differenced 'Points'")eacf(diff(y))
DeLay 11
Box.test(diff(y),lag=20,type="Ljung-Box") adf.test(diff(y),alternative="stationary")
auto.arima(y, d=1, D=1, max.p=4, max.q=3,max.P=4, max.Q=3, max.order=20, max.d=2, max.D=2, start.p=0, start.q=0, start.P=0, start.Q=0,stationary=FALSE, seasonal=TRUE,ic=c("aicc", "aic", "bic"), stepwise=FALSE, approximation=FALSE, trace=FALSE, xreg=NULL,test=c("kpss","adf","pp"),seasonal.test=c("ocsb","ch"),allowdrift=TRUE, allowmean=TRUE, lambda=NULL, parallel=FALSE, num.cores=NULL)
m1=arima(y,order=c(0,1,1),include.mean=TRUE)m1
tsdisplay(diff(y),plot.type=c("partial"),points=TRUE,ci.type="white",20,na.action=na.contiguous,main="Derrick Rose performance by game",xlab="Game Number",ylab="Residuals") #fit a time series, ACF and PACF plotsfit=Arima(y,order=c(0,1,1),seasonal=c(0,1,1))tsdisplay(residuals(fit),plot.type=c("partial"),points=TRUE,ci.type="white",20,na.action=na.contiguous,main="Derrick Rose performance by game",xlab="Game Number",ylab="Residuals") fit1=Arima(y,order=c(2,1,3),seasonal=c(0,1,0)) #fitting 2 extra seasonal terms in the forecast modelres=residuals(fit1)tsdisplay(res,plot.type=c("partial"),points=TRUE,ci.type="white",20,na.action=na.contiguous,main="Derrick Rose performance by game",xlab="Game Number",ylab="Residuals") plot(forecast(fit1,h=35),xlab="Game Number",ylab="Points",main="Derrick Rose predicted points in future games")
DeLay 12
References
Banerjee, J. J. Dolado, J. W. Galbraith, and D. F. Hendry (1993): Cointegration, Error Correction, and the Econometric Analysis of Non-Stationary Data, Oxford University Press, Oxford.
Box, G. E. P. and Cox, D. R. (1964) An analysis of transformations (with discussion). Journal of the Royal Statistical Society B, 26, 211–252.
Box, G. E. P. and Pierce, D. A. (1970), Distribution of residual correlations in autoregressive-integrated moving average time series models. Journal of the American Statistical Association, 65, 1509–1526.
Harvey, A. C. (1993) Time Series Models. 2nd Edition, Harvester Wheatsheaf, NY, pp. 44, 45.
Hyndman and Athanasopoulos (2014) Forecasting: principles and practice, OTexts: Melbourne, Australia.http://www.otexts.org/fpp/
Hyndman, R.J. and Khandakar, Y. (2008) "Automatic time series forecasting: The forecast package for R", Journal of Statistical Software, 26(3).
Ljung, G. M. and Box, G. E. P. (1978), On a measure of lack of fit in time series models. Biometrika 65, 297–303.
S. E. Said and D. A. Dickey (1984): Testing for Unit Roots in Autoregressive-Moving Average Models of Unknown Order.Biometrika 71, 599–607.
Venables, W. N. and Ripley, B. D. (2002) Modern Applied Statistics with S. Fourth edition. Springer.
DeLay 13