A Time Series Analysis for Predicting Basketball Statistics

A Time Series Analysis for Predicting Basketball Statistics

Joseph DeLayDepartment of Statistics and Actuarial Science, University of Iowa,

Iowa City, IA 52242

Summary of findings

In this report, we looked at a game-by-game analysis of points scored by Chicago Bulls star,

Derrick Rose. Using ARIMA methods, we planned to find the best fitting model for the data

in order to forecast future point-scoring potential. Our analysis shows that the best fitting

model is an IMA(1,1) model, where the data had to be differenced once. When running the

forecast model, the mean narrows out to a straight line at Rose’s average points per game.

Though the results are somewhat inconclusive, it can be said with certainty that adding

more data points, potentially over multiple years, would improve this model’s accuracy.

Abstract and Personal Interest

In the NBA today, there are hundreds of statistics that professional statisticians

follow. Part of their jobs are to create statistical models to best predict player performance

so they can determine which players to keep or cut from the roster, as well as which ones

to pay the large contracts. In this paper, I plan to create my own models to determine a way

to predict a player’s points in a game based on his previous point totals. I will be using data

on Derrick Rose, star point guard of the Chicago Bulls, that I received from espn.com. My

goal is to discover an ARIMA model that can be used to forecast Rose’s future point totals.

Players change all the time in the NBA, from rookies to seasoned veterans. Injuries

occur, players are traded to different teams or dropped from their team all together, and

players are added to teams to take the place of someone already starting. For these

reasons, as well as a growing NCAA basketball market and changing NBA rules, it has

become increasingly difficult to predict players’ statistics. I will be taking data on Derrick

DeLay 1

Rose on a game-by-game basis (not taking the games that he didn’t play into consideration)

and using my models to predict his points per game.

I have an extreme personal interest in this project because it essentially opens the

door for the work I want to do in the future. This fall I was offered an opportunity to work

with Marc Grossman, the head statistician of the Chicago Bulls, for the two Chicago land

college basketball teams he does statistics for. Although my initial work with him will be

nothing more than petty tasks and fact and data checking, I want to use that experience and

what I have learned already to conduct independent research on basketball statistics and

one day become a professional sports statistician. This project will hopefully provide me

with resume-level experience to show potential future employers.

Background of the Scientific Questions

The data I’m using is going to possess characteristics of a normally distributed data set. We

define the data set by

D = d1, d2, …, dn ~ N(i,i2), where i=1,2,…,n

with {D} representing the data set of Rose’s in-game statistics, representing the average

of each statistic i (only including games that he actually played), and i2 representing the

square of the standard deviation of each statistic i. This data will be normally distributed

with mean =(dn)/n, n=number of games Rose played in the 2014-15 NBA season and

expected value E(dn)=.

In this data set, I will be using 51 points corresponding to the games that Rose played for

the Chicago Bulls throughout the 2014-15 NBA season. I decided to cut out the data with

games where he did not play since those games show all statistics as zero and that would

have not only heavily skewed the data negatively, but those game numbers probably would

have been calculated as outliers.

The data I used for this project, I received from espn.com at

http://espn.go.com/nba/player/gamelog/_/id/3456/derrick-rose

Technical Analyses with Interpretations

DeLay 2


I started by plotting the time series. It was difficult whether or not to expect a

seasonal pattern in the data because it was only recorded over an 8 month period out of the

year. However, as you can see below, there is somewhat of a normal distribution, meaning

there was an abundance of high-point games a little before the midway point of the season.

This time of the season is usually when a team’s best players are playing the most minutes

per game, so higher point totals can be expected. Keep in mind that, though there are 82

games in the NBA season, Rose only played 51 games this year.

DeLay 3

After taking an initial look at the time series, a histogram can be plotted to verify the

assumption that the data would be normally distributed. The bell curve was almost an

exact fit, with the tip of the curve at =17.7 points per game.

I then computed the Box-Cox graph. The maximum likelihood estimate of =1, as well as the normality of the histogram, suggests that we do not need to transform the data.

DeLay 4

From these graphs, we can tell almost with certainty is that we will need to difference the

data. The random significance at lag 8 for both the ACF and PACF show that there are

trends and seasonality in the data that differencing can eliminate. When the data is

differenced once, as shown below, both ACF and PACF show clear cutoff points at lags 1 and

3, respectively, which is what we are looking for.

DeLay 5

We can now look at different tests to check for stationarity and determine if

differencing is necessary for the data. The first test we’ll look at is the Box-Ljung test, which

shows if there is significant evidence of non-zero correlations for a set number of lags. If

the p-value is less than 0.05 for the given parameters, then the series is stationary. The

second test, which is used along with the Box-Ljung test, is the Augmented Dickey-Fuller

test, which suggests that if the p-value for the given inputs is less than 0.05, then the data is

stationary and does not need to be differenced. We will input the non-differenced data first,

checking the first 20 lags for non-zero correlations.

DeLay 6

Here, the null hypothesis is that the data are non-stationary and need differencing. Both

tests compute a p-value greater than 0.05, so we fail reject the null hypothesis. This

suggests that we do need to difference the data; however, we do not know how many times

we will need to difference the data, so we will run another test with d=1, or the first

difference.

This run of the tests computes p-values less than 0.05 for both, meaning we can reject the

null hypothesis that the data are non-stationary and do not need differencing. The data is

now stationary and d=1 in the ARIMA(p,d,q) model. Furthermore, the following EACF

provides evidence for the rest of the model, ARMA(p,q). With a vertex at (0,1), this suggests

the parameters are p=0 and q=1.

DeLay 7

In order to verify the model, we can use the ARIMA estimation command in R. This

command returns the best ARIMA model according to either AIC, AICc, or BIC value. The

function conducts a search over possible models within the order constraints provided

(Hyndman). After running the function, it returned the same model we identified, thus our

verification is complete, with ARIMA(0,1,1), or IMA(1,1). With the MA coefficient,

1=-0.8304, this model can be represented as

Yt-Yt-1=et+0.8304et-1

The final step of this project is to create a forecasted model for future events, in this case

the number of points Rose will score in future games. We fit time series display models for

both the differenced data and the residuals of the data, as well as the forecasted model. As

the residuals are plotted, all the spikes are now within the significance limits, so they

appear to be white noise (OTexts). The top left graph on the next page is the time series

display for the differenced data, the top right graph represents the residuals from the

differenced data, the bottom graph represents the residuals of the forecasted data, and the

last graph (after the next page) is the plot of the forecasted data points.

DeLay 8

DeLay 9

Conclusion

The goal of this project was to find an ARIMA model that could be used to predict Derrick

Rose’s point totals for future games. The computations produced an IMA(1,1) model,

leaving the forecasted data somewhat inconclusive. The reason for the flat tail in this graph

is because with only 51 data points to work with, the new data points are simply going to

quickly converge to the average (17.7 points per game). With more data, say, a couple years

worth, this model could produce more fluctuating data points from Yt+1 until it again begins

to average out, at a slower pace. Now, the reason for the discontinuous increase in the

confidence intervals at 80% and 95% is, again, with the few number of data points we used,

there was not much of a gap (by the NBA’s standards) between Derrick Rose’s lowest point

total of the season (not counting games he didn’t play) and his highest of the season, so it is

left close to 100% that he will score between those numbers on any given day. Models like

this can be used for the short term, but much more data must be added to guarantee its

usefulness in the long term.

DeLay 10

Data used and relevant R codes

Data: Derrick Rose game-by-game points for the 2014-2015 NBA regular season, provided

by http://espn.go.com/nba/player/gamelog/_/id/3456/derrick-rose

data=read.csv("DRose2014-15StatsonGP.csv", head=TRUE) #set working directory and call

# install TSA and forecast packageslibrary(TSA)library(forecast)

minutes=data$MINrebounds=data$REBassists=data$ASTblocks=data$BLKsteals=data$STLpersonals=data$PFturnovers=data$TOpoints=data$PTSGM=data$GMNUM

points=as.matrix(points)x=as.vector(t(points))y=subset(x,x>0)z=ts(y,start=1,frequency=1)plot(y,ylab="Points",main="Derrick Rose points in regular season games") hist(y,prob=TRUE,xlab="Points",main="Derrick Rose points in regular season games") lines(density(y)) par(mfrow=c(2,1))acf(y,main="ACF of 'Points'")pacf(y,main="PACF of 'Points'")eacf(y)

bc=BoxCox.ar(y,method="mle")

Box.test(y,lag=20,type="Ljung-Box") adf.test(y,alternative="stationary")

acf(diff(y),main="ACF of differenced 'Points'")pacf(diff(y),main="PACF of differenced 'Points'")eacf(diff(y))

DeLay 11


Box.test(diff(y),lag=20,type="Ljung-Box") adf.test(diff(y),alternative="stationary")

auto.arima(y, d=1, D=1, max.p=4, max.q=3,max.P=4, max.Q=3, max.order=20, max.d=2, max.D=2, start.p=0, start.q=0, start.P=0, start.Q=0,stationary=FALSE, seasonal=TRUE,ic=c("aicc", "aic", "bic"), stepwise=FALSE, approximation=FALSE, trace=FALSE, xreg=NULL,test=c("kpss","adf","pp"),seasonal.test=c("ocsb","ch"),allowdrift=TRUE, allowmean=TRUE, lambda=NULL, parallel=FALSE, num.cores=NULL)

m1=arima(y,order=c(0,1,1),include.mean=TRUE)m1

tsdisplay(diff(y),plot.type=c("partial"),points=TRUE,ci.type="white",20,na.action=na.contiguous,main="Derrick Rose performance by game",xlab="Game Number",ylab="Residuals") #fit a time series, ACF and PACF plotsfit=Arima(y,order=c(0,1,1),seasonal=c(0,1,1))tsdisplay(residuals(fit),plot.type=c("partial"),points=TRUE,ci.type="white",20,na.action=na.contiguous,main="Derrick Rose performance by game",xlab="Game Number",ylab="Residuals") fit1=Arima(y,order=c(2,1,3),seasonal=c(0,1,0)) #fitting 2 extra seasonal terms in the forecast modelres=residuals(fit1)tsdisplay(res,plot.type=c("partial"),points=TRUE,ci.type="white",20,na.action=na.contiguous,main="Derrick Rose performance by game",xlab="Game Number",ylab="Residuals") plot(forecast(fit1,h=35),xlab="Game Number",ylab="Points",main="Derrick Rose predicted points in future games")

DeLay 12

References

Banerjee, J. J. Dolado, J. W. Galbraith, and D. F. Hendry (1993): Cointegration, Error Correction, and the Econometric Analysis of Non-Stationary Data, Oxford University Press, Oxford.

Box, G. E. P. and Cox, D. R. (1964) An analysis of transformations (with discussion). Journal of the Royal Statistical Society B, 26, 211–252.

Box, G. E. P. and Pierce, D. A. (1970), Distribution of residual correlations in autoregressive-integrated moving average time series models. Journal of the American Statistical Association, 65, 1509–1526.

Harvey, A. C. (1993) Time Series Models. 2nd Edition, Harvester Wheatsheaf, NY, pp. 44, 45.

Hyndman and Athanasopoulos (2014) Forecasting: principles and practice, OTexts: Melbourne, Australia.http://www.otexts.org/fpp/

Hyndman, R.J. and Khandakar, Y. (2008) "Automatic time series forecasting: The forecast package for R", Journal of Statistical Software, 26(3).

Ljung, G. M. and Box, G. E. P. (1978), On a measure of lack of fit in time series models. Biometrika 65, 297–303.

S. E. Said and D. A. Dickey (1984): Testing for Unit Roots in Autoregressive-Moving Average Models of Unknown Order.Biometrika 71, 599–607.

Venables, W. N. and Ripley, B. D. (2002) Modern Applied Statistics with S. Fourth edition. Springer.

DeLay 13

http://www.otexts.org/fpp/

Documents

A Time Series Analysis for Predicting Basketball Statistics