Upload
isaac-edmiston
View
60
Download
0
Embed Size (px)
Citation preview
Isaac Edmiston
Examining Individual Performance in the National Basketball Association
Research Question
The National Basketball Association (NBA) is the most prestigious professional
basketball organization in the world because it employs the best basketball players on the planet.
All of the men in the NBA are great basketball players, but which ones are the most effective at
helping their teams win games? Evaluating a player’s performance is not an easy task, as there
are so many variables that can have an effect on individual productivity, with many being
difficult or impossible to quantify. The NBA keeps track of a plethora of individual player
statistics, and new statistics are being brought into the equation each year. Considering that there
are seemingly countless statistics on players in the NBA, I want to know which statistics matter
the most in assessing an individual’s performance on the court. Thus, the goal of my project is to
determine which statistics should be considered when evaluating a player’s productivity in the
National Basketball Association, using player data from the last 20 complete regular seasons.
Literature Review
The debate about the best way to evaluate an individual player’s performance in the NBA
has been ongoing for many years. In 1999 David J. Berri presented an econometric model where
he attempted to measure an individual’s productivity by associating player statistics to team wins
(Berri 411). For the individual player statistics that he used, he used per game averages rather
than season totals (412). Berri’s use of per game statistics rather than season totals inspired my
decision to use per game statistics as my independent variables in my economic model to try to
account for players playing in different numbers of games each season due to injuries and other
factors. Berri claimed that previous uses of the Cobb-Douglas production function to examine
the relationship between team wins and player statistics were not as proficient as using a linear
functional form, and I took note of this when considering the functional form for my model (413-
1
Isaac Edmiston
14). In his model using team wins as the dependent variable on data from the 1994 to 1997
season, Berri used points per shot, free throws made, free throws attempted, offensive rebounds,
defensive rebounds, and assist to turnover ratio, along with opponent equivalents, as his
explanatory variables (415). From his list of variables and the results of his work, I decided to
incorporate free throws made, offensive rebounds, defensive rebounds, and points as independent
variables into my own model.
Another source that I consulted was a 2007 study by Justin Kubatko and other authors
where they discussed several different NBA player statistics and their advantages and
disadvantages in assessing individual performance. This source introduced me to the concept of a
player’s effective field goal percentage, which accounts for a player’s three-point field goals
made in addition to two-point field goals (Kubatko et al. 10). Effective field goal percentage
accounting for three-point shots being worth more points than two-point shots and the authors
stating that it is one of four main factors that they claim contributes most to offensive player
ratings, I decided to use effective field goal percentage as an independent variable in my model
(12). I also took notice that many of the models used by Kubatko and his colleagues had team
winning percentage as the dependent variable and that they got much of their data from
Basketball-Reference, which lists individuals’ effective field goal percentages (18-19).
In 2011 Rob Simmons and David J. Berri authored a study connecting a player’s pay to
productivity, and they provided some noteworthy observations about analyzing individual
performance (Simmons and Berri 381). They noted that a player’s individual performance as
well as his cooperation with his teammates factors into productivity, which influenced my
decision to bring in assists as an explanatory variable in my model (382). Before revealing their
own model, Simmons and Berri introduced me to the concept of NBA efficiency, which adds a
player’s positive statistics including steals and blocked shots and subtracts negative statistics like
2
Isaac Edmiston
turnovers (383). Using data from 1987 to 2007, they derived their own dependent variable
accounting for team wins and adjusted for a player’s performance per 48 minutes (385). Their
dependent variable taking into account performance per length of an NBA game helped solidify
my decision to choose win shares per 48 minutes as my dependent variable since I am looking at
per game statistics and also prompted me to consider minutes played per game as another
independent variable. In their analysis using their model, Simmons and Berri found that three-
point field goals made, offensive rebounds, and defensive rebounds carried high weight on their
dependent variable and that age was statistically significant in their model (385). The results of
their work inspired me to use three-point field goals made, player age, steals, blocked shots, and
turnovers in my economic model as additional explanatory variables.
DESCRIPTION OF VARIABLES TO BE USED IN MODELws48 Win shares per 48 minutes of playage Age at start of February 1 of seasonmp Minutes played per game
three_p 3-point field goals made per gameefg_per Effective field goal percentage
ft Free throws made per gameorb Offensive rebounds per gamedrb Defensive rebounds per gameast Assists per gamestl Steals per gameblk Blocks per gametov Turnovers committed per gamepts Points scored per game
posit Dummy variable for position playedteam Dummy variable for team played on
Economic Model
For my model, I am using individual per game data from the 1995 to the 2014 NBA
season, which I obtained from Basketball-Reference. Because Basketball-Reference does not
3
Isaac Edmiston
incorporate team wins as a part of player statistics, I decided that win shares per 48 minutes of
play would be a good choice for my dependent variable in assessing player performance. In
determining the functional form for my model, I consulted an article by Richard A. Hofler and
James E. Payne. These two men constructed a model where they used the natural log of wins as
their dependent variable and the natural log of several player statistics that I have already
mentioned as their independent variables (Hofler and Payne 295). Their work inspired me to
transform several of my explanatory variables by taking the natural log of free throws made,
offensive rebounds, defensive rebounds, assists, steals, turnovers, and points scored. They did
not take the natural log of blocks in their regression, so I chose to leave blocks as linear in my
model (295). Although they took the natural log of wins, I did not transform win shares per 48
minutes because many of the observations of this variable are negative, and taking the natural log
would exclude all of the negative values from my regression. I also kept minutes played in linear
form since Simmons and Berri’s model used a linear functional form, and I did not take the
natural log of effective field goal percentage for coefficient interpretation reasons (Simmons and
Berri 385). Three pointers made was not one of the variables that Hofler and Payne used in their
model, and I decided to leave it untransformed in my model because taking its natural log would
have dropped over 3,600 observations from my regression. For the variable of age, I chose to
incorporate a polynomial functional form by squaring the variable since Simmons and Berri did
the same in their model (385). This was logical because you would expect a player to perform
better as he gains more experience in the league and then to see his performance decline after he
reaches a point where his body can no longer take the toll it could when he was younger. To take
into account different players being more prone to record certain statistics over others because of
the nature and demands of their position, I generated a dummy variable for position played. I
also created a dummy variable for team to account for players playing on teams with differing
4
Isaac Edmiston
levels of success. Finally, I decided to restrict the data I used in my regression to players who
played five or more minutes per game in order to exclude observations that could generate
abnormally high or abnormally low values for my dependent variable on account of a player
recording only a couple minutes of play for an entire season. I wanted to focus on the players
who held genuine roles for their teams on a regular basis, not the ones who might only see one
minute of action if their team was winning or losing by an extreme margin. The summary
statistics for all of the variables in my regression besides my two dummy variables can be found
in Figure 1 of the Appendix.
Econometric Analysis
In order to address my research question and determine which statistics matter in
assessing a player’s individual productivity as measured by win shares per 48 minutes of play, I
had to conduct econometric analysis on my model using STATA. Because my data contains a
cross-sectional element in the form of all different NBA players who played in a particular
season and a time-series element, the past 20 complete NBA seasons, I specified my data as a
panel data set. Since there cannot be duplicates of the same cross-sectional variable in with the
same time-series value for panel data, I had to edit a few players in my data set by adding an A to
their last name in order to differentiate them from other players who shared their same name so
that STATA would know that no player was listed more than once for any particular season. To
start to try to answer my research question, I ran an ordinary least squares regression (OLS) on
my model, and the output from this regression can be found in Figure 2.
To check for potential problems with the specification of my model, I started by checking
for the presence of mulitcollinearity. Although my model did not show any signs of perfect
mulitcollinearity since STATA ran my regression without dropping any of my explanatory
variables, imperfect mulitcollinearity could still pose a problem. The presence of imperfect
5
Isaac Edmiston
mulitcollinearity could result in increased variance of my model’s estimated beta coefficients and
could lead to lower t-statistics, which would make hypothesis testing less reliable. To test my
model for the presence of imperfect mulitcollinearity, I began by generating a correlation matrix
of the variables in my model (Figure 3). Most of the correlation coefficients between the
independent variables are below 0.8, the level that could indicate the presence of
mulitcollinearity, but there are a few coefficients in the matrix that exceed this value. The most
extreme of these values is the coefficient between age and age squared, which was to be
expected because of its polynomial functional form. The presence of a few high correlation
coefficient values means that some of my explanatory variables are correlated with each other,
but it does not necessarily mean that mulitcollinearity is a problem. To further assess the
potential issue of mulitcollinearity, I generated a variance inflation factor (VIF) for the each of
the explanatory variables in my model (Figure 4). Most of the VIFs are either below or a little
bit above the value of 5, above which value could indicate a mulitcollinearity problem, with the
mean VIF around 6. Again, the value of the VIF for age and age squared stood out as expected,
and these two values skewed the value of the mean VIF toward a larger number. Although
several VIF values are above 5, most of the values are relatively low and there does not seem to
be an indication that mulitcollinearity is harmful to the model. I feel safe not trying to do
anything about possible indicators of mulitcollinearity because omitted variable bias could be
introduced in doing so.
Next I checked for the presence of serial correlation, which could create the problem of
biasing t-statistics upward and making hypothesis testing less reliable. I first generated a scatter
plot of the residuals of my model against my time variable year to try to assess whether or not
serial correlation could be an issue (Figure 5). There was no obvious evidence of serial
correlation from the scatter plot, so I calculated a Durbin-Watson statistic to check for the
6
Isaac Edmiston
potential problem. The Durbin-Watson statistic that STATA generated came out to be about 1.46,
which was less than the lower bound of 1.53, meaning that I can reject the null hypothesis that
such a Durbin-Watson value came from a regression without positive serial correlation. Because
this value was not very far below the lower bound, the extent of serial correlation is probably not
severe, but it still needs to be addressed. I decided to address the issue by fixing the error term by
running Newey-West standard errors on a generalized least squares (GLS) regression that I
generated for my model (Figure 6). Because my data set is so large, Newey-West should work
well in treating the issue of serial correlation by inflating the standard errors to account for the
standard errors to be too low as a result of serial correlation.
Lastly, I checked my model for the issue of heteroskedasticity, which could result in the
standard errors in my model being biased downward and making hypothesis testing unreliable
because of upward biased t-statistics. Since I did not know where heteroskedasticity could likely
come from, I performed a White Test to search for heteroskedasticity from any source. STATA
produced a very high test statistic, which allowed me to reject the null hypothesis of the presence
of homoscedasticity and indicated that heteroskedasticity was present in my model. In order to
try to treat this problem and because of the large size of my data set, I chose to fix the standard
errors by creating robust standard errors that would inflate the standard errors by differing
amounts based on the level of heteroskedasticity. I ran an OLS regression with robust standard
errors that also stayed consistent with the treatment of the serial correlation issue (Figure 7). The
issue of heteroskedasticity does not bias beta coefficients, so I can confidently interpret these
values in my model.
After addressing the potential issue of mulitcollinearity and treating my regression for the
problems of serial correlation and heteroskedasticity, my model is correctly specified and I can
confidently use it for interpretation. To try to answer my research question of which statistics
7
Isaac Edmiston
matter in determining a player’s productivity in the NBA, I will run a two-sided 95 percent level
of significance test on the independent variables of my model. I will compare the t-statistics of
the variables in my model to the critical t-value of 1.96 to see if I reject or fail to reject my null
hypothesis given below:
H0: explanatory variable does not have an effect on win shares per 48 minutes
HA: explanatory variable does have an effect on win shares per 48 minutes
VARIABLE t-STATISTIC REJECT/FAIL TO REJECTage 1.70 Fail to reject
age_sq -0.80 Fail to rejectmp -6.94 Reject
three_p 4.87 Rejectefg_per 43.83 Rejectlog_ft 33.36 Reject
log_orb 12.72 Rejectlog_drb 9.06 Rejectlog_ast 18.35 Rejectlog_stl 11.71 Rejectlog_tov -42.04 Reject
blk 8.40 Rejectlog_pts -4.51 Reject
Therefore, with 95 percent confidence I can say that I fail to reject the null hypothesis that a
player’s age does not have an effect on his win shares per 48 minutes of play, but I do reject the
null hypotheses that minutes played, three pointers made, effective field goal percentage, free
throws made, offensive rebounds, defensive rebounds, assists, steals, turnovers, blocks, and
points scored per game do not have an effect on win shares per 48 minutes of play. I did not
include the dummy variables for position and team in my chart above because I did not include
them in my regression so that I could analyze how position played and team played on affects a
player’s performance, but instead so that I could take into account position and team effects on
the variables that I was trying to analyze for effect on player performance.
8
Isaac Edmiston
Analysis Description
After conducting my econometric analysis, I was able to finally answer my research
question and see which statistics matter in determining a player’s productivity in the NBA. Most
of the signs of the beta coefficients and the t-values came out to be as expected with the majority
of the statistics having positive effects on win shares per 48 minutes and with turnovers per game
having negative effects, but it was somewhat surprising to see that the coefficient on points per
game was negative. A one percent increase in points scored per game is associated with about a
0.01 unit decrease in win shares per 48 minutes of play. A possible reason for this negative
relationship could be because of the high correlation between points per game and minutes
played per game (see Figure 3). Minutes played per game also had a negative value for its
predicted beta coefficient, and the magnitude of its t-statistic is higher than that of points per
game. Win shares per 48 minutes might decrease as points per game increases partly because of
its negative relationship to minutes played per game, but it could also be due to other factors
such as players who average a high number of points per game may be more prone to having
lower efficiency numbers like shooting percentages because of their high volume of shots than
players who only take a small number of shots. When looking at some of the most significant
variables in the equation, it is not surprising to see turnovers per game having a negative t-
statistic with such a high magnitude or to see that effective field goal percentage has the highest
positive t-statistic in the regression. However, it is surprising that the t-statistic was so high on
free throws made per game, considering that the coefficient of points per game was negative and
that there was a predictably high correlation between free throws and points (see Figure 3). A
one percent increase in free throws made per game is associated with about a 0.05 unit increase
in win shares per 48 minutes of play. This seems to imply that free throws are the most effective
way that a player can score, and this makes logical sense considering that when a player gets
9
Isaac Edmiston
fouled it hurts the opposing team both by potentially giving away points and putting players in
foul trouble. The fact that losing teams often try to foul late in games to try to stop the clock and
keep the game from ending could be another reason why free throws made has one of the largest
effects on win shares per 48 minutes of all the variables. Although I treated the problems of
serial correlation and heteroskedasticity in my econometric analysis, my model is still subject to
shortcomings. Problems could have arisen through omitted variable bias with me failing to
include variables in my regression that matter in determining productivity, or my decision to not
take the natural log of my dependent variable for fear of losing observations from my regression
could have created an issue with my functional form.
Despite the potential shortcomings of my model, the results of my experiment can help
NBA teams make better decisions on which players to pursue and which ones to avoid. When it
might not be very obvious of what contributions a particular player can make to help a team
succeed, teams could look at that player’s effective field goal percentage, free throws made per
game, turnovers committed per game, and even assists per game as potential indicators of how
productive he might be in helping the team win. My model has laid a good groundwork for
further research in trying to find the statistics that matter the most in assessing individual
performance. There will never be a model that is a perfect indicator of how productive a player
is, but models like mine that carry room to be constantly improved upon could potentially change
traditional notions of how players are evaluated in the NBA.
Sources Referenced
Basketball-Reference. Sports Reference LLC. 2016. Web. 24 Mar. 2016.
10
Isaac Edmiston
Berri, David J. “Who Is ‘Most Valuable’? Measuring the Player’s Production of Wins in the
National Basketball Association.” Managerial and Decision Economics 20.8 (1999):
411-427. Web. 12 Apr. 2016.
Hofler, Richard A., and James E. Payne. “Measuring Efficiency in the National Basketball
Association.” Economics Letters 55.2 (1997): 293-299. Web. 22 Mar. 2016.
Kubatko, Justin, et al. “A Starting Point for Analyzing Basketball Statistics.” Journal of
Quantitative Analysis in Sports 3.3 (2007): 1-22. Web. 22 Mar. 2016.
Simmons, Rob, and David J. Berri. “Mixing the Princes and the Paupers: Pay and Performance
in the National Basketball Association.” Labour Economics 18.3 (2011): 381-388. Web.
21 Mar. 2016.
Appendix
Figure 1: Summary statistics for the variables in OLS regression
11
Isaac Edmiston
Figure 2: OLS regression output with dummy variables for position and team
12
Isaac Edmiston
13
Isaac Edmiston
14
Isaac Edmiston
Figure 3: Correlation matrix for variables in OLS regression
15
Isaac Edmiston
Figure 4: Variance Inflation Factor (VIF) output for variables in OLS regression
16
Isaac Edmiston
Figure 5: Scatter plot of residual against year (NBA season)
17
Isaac Edmiston
Figure 6: GLS regression output with Newey-West standard errors
18
Isaac Edmiston
19
Isaac Edmiston
20
Isaac Edmiston
Figure 7: OLS regression output with robust standard errors
21
Isaac Edmiston
22