Isaac Edmiston Project

Isaac Edmiston

Examining Individual Performance in the National Basketball Association

Research Question

The National Basketball Association (NBA) is the most prestigious professional

basketball organization in the world because it employs the best basketball players on the planet.

All of the men in the NBA are great basketball players, but which ones are the most effective at

helping their teams win games? Evaluating a player’s performance is not an easy task, as there

are so many variables that can have an effect on individual productivity, with many being

difficult or impossible to quantify. The NBA keeps track of a plethora of individual player

statistics, and new statistics are being brought into the equation each year. Considering that there

are seemingly countless statistics on players in the NBA, I want to know which statistics matter

the most in assessing an individual’s performance on the court. Thus, the goal of my project is to

determine which statistics should be considered when evaluating a player’s productivity in the

National Basketball Association, using player data from the last 20 complete regular seasons.

Literature Review

The debate about the best way to evaluate an individual player’s performance in the NBA

has been ongoing for many years. In 1999 David J. Berri presented an econometric model where

he attempted to measure an individual’s productivity by associating player statistics to team wins

(Berri 411). For the individual player statistics that he used, he used per game averages rather

than season totals (412). Berri’s use of per game statistics rather than season totals inspired my

decision to use per game statistics as my independent variables in my economic model to try to

account for players playing in different numbers of games each season due to injuries and other

factors. Berri claimed that previous uses of the Cobb-Douglas production function to examine

the relationship between team wins and player statistics were not as proficient as using a linear

functional form, and I took note of this when considering the functional form for my model (413-

1

Isaac Edmiston

14). In his model using team wins as the dependent variable on data from the 1994 to 1997

season, Berri used points per shot, free throws made, free throws attempted, offensive rebounds,

defensive rebounds, and assist to turnover ratio, along with opponent equivalents, as his

explanatory variables (415). From his list of variables and the results of his work, I decided to

incorporate free throws made, offensive rebounds, defensive rebounds, and points as independent

variables into my own model.

Another source that I consulted was a 2007 study by Justin Kubatko and other authors

where they discussed several different NBA player statistics and their advantages and

disadvantages in assessing individual performance. This source introduced me to the concept of a

player’s effective field goal percentage, which accounts for a player’s three-point field goals

made in addition to two-point field goals (Kubatko et al. 10). Effective field goal percentage

accounting for three-point shots being worth more points than two-point shots and the authors

stating that it is one of four main factors that they claim contributes most to offensive player

ratings, I decided to use effective field goal percentage as an independent variable in my model

(12). I also took notice that many of the models used by Kubatko and his colleagues had team

winning percentage as the dependent variable and that they got much of their data from

Basketball-Reference, which lists individuals’ effective field goal percentages (18-19).

In 2011 Rob Simmons and David J. Berri authored a study connecting a player’s pay to

productivity, and they provided some noteworthy observations about analyzing individual

performance (Simmons and Berri 381). They noted that a player’s individual performance as

well as his cooperation with his teammates factors into productivity, which influenced my

decision to bring in assists as an explanatory variable in my model (382). Before revealing their

own model, Simmons and Berri introduced me to the concept of NBA efficiency, which adds a

player’s positive statistics including steals and blocked shots and subtracts negative statistics like

2

Isaac Edmiston

turnovers (383). Using data from 1987 to 2007, they derived their own dependent variable

accounting for team wins and adjusted for a player’s performance per 48 minutes (385). Their

dependent variable taking into account performance per length of an NBA game helped solidify

my decision to choose win shares per 48 minutes as my dependent variable since I am looking at

per game statistics and also prompted me to consider minutes played per game as another

independent variable. In their analysis using their model, Simmons and Berri found that three-

point field goals made, offensive rebounds, and defensive rebounds carried high weight on their

dependent variable and that age was statistically significant in their model (385). The results of

their work inspired me to use three-point field goals made, player age, steals, blocked shots, and

turnovers in my economic model as additional explanatory variables.

DESCRIPTION OF VARIABLES TO BE USED IN MODELws48 Win shares per 48 minutes of playage Age at start of February 1 of seasonmp Minutes played per game

three_p 3-point field goals made per gameefg_per Effective field goal percentage

ft Free throws made per gameorb Offensive rebounds per gamedrb Defensive rebounds per gameast Assists per gamestl Steals per gameblk Blocks per gametov Turnovers committed per gamepts Points scored per game

posit Dummy variable for position playedteam Dummy variable for team played on

Economic Model

For my model, I am using individual per game data from the 1995 to the 2014 NBA

season, which I obtained from Basketball-Reference. Because Basketball-Reference does not

3

Isaac Edmiston

incorporate team wins as a part of player statistics, I decided that win shares per 48 minutes of

play would be a good choice for my dependent variable in assessing player performance. In

determining the functional form for my model, I consulted an article by Richard A. Hofler and

James E. Payne. These two men constructed a model where they used the natural log of wins as

their dependent variable and the natural log of several player statistics that I have already

mentioned as their independent variables (Hofler and Payne 295). Their work inspired me to

transform several of my explanatory variables by taking the natural log of free throws made,

offensive rebounds, defensive rebounds, assists, steals, turnovers, and points scored. They did

not take the natural log of blocks in their regression, so I chose to leave blocks as linear in my

model (295). Although they took the natural log of wins, I did not transform win shares per 48

minutes because many of the observations of this variable are negative, and taking the natural log

would exclude all of the negative values from my regression. I also kept minutes played in linear

form since Simmons and Berri’s model used a linear functional form, and I did not take the

natural log of effective field goal percentage for coefficient interpretation reasons (Simmons and

Berri 385). Three pointers made was not one of the variables that Hofler and Payne used in their

model, and I decided to leave it untransformed in my model because taking its natural log would

have dropped over 3,600 observations from my regression. For the variable of age, I chose to

incorporate a polynomial functional form by squaring the variable since Simmons and Berri did

the same in their model (385). This was logical because you would expect a player to perform

better as he gains more experience in the league and then to see his performance decline after he

reaches a point where his body can no longer take the toll it could when he was younger. To take

into account different players being more prone to record certain statistics over others because of

the nature and demands of their position, I generated a dummy variable for position played. I

also created a dummy variable for team to account for players playing on teams with differing

4

Isaac Edmiston

levels of success. Finally, I decided to restrict the data I used in my regression to players who

played five or more minutes per game in order to exclude observations that could generate

abnormally high or abnormally low values for my dependent variable on account of a player

recording only a couple minutes of play for an entire season. I wanted to focus on the players

who held genuine roles for their teams on a regular basis, not the ones who might only see one

minute of action if their team was winning or losing by an extreme margin. The summary

statistics for all of the variables in my regression besides my two dummy variables can be found

in Figure 1 of the Appendix.

Econometric Analysis

In order to address my research question and determine which statistics matter in

assessing a player’s individual productivity as measured by win shares per 48 minutes of play, I

had to conduct econometric analysis on my model using STATA. Because my data contains a

cross-sectional element in the form of all different NBA players who played in a particular

season and a time-series element, the past 20 complete NBA seasons, I specified my data as a

panel data set. Since there cannot be duplicates of the same cross-sectional variable in with the

same time-series value for panel data, I had to edit a few players in my data set by adding an A to

their last name in order to differentiate them from other players who shared their same name so

that STATA would know that no player was listed more than once for any particular season. To

start to try to answer my research question, I ran an ordinary least squares regression (OLS) on

my model, and the output from this regression can be found in Figure 2.

To check for potential problems with the specification of my model, I started by checking

for the presence of mulitcollinearity. Although my model did not show any signs of perfect

mulitcollinearity since STATA ran my regression without dropping any of my explanatory

variables, imperfect mulitcollinearity could still pose a problem. The presence of imperfect

5

Isaac Edmiston

mulitcollinearity could result in increased variance of my model’s estimated beta coefficients and

could lead to lower t-statistics, which would make hypothesis testing less reliable. To test my

model for the presence of imperfect mulitcollinearity, I began by generating a correlation matrix

of the variables in my model (Figure 3). Most of the correlation coefficients between the

independent variables are below 0.8, the level that could indicate the presence of

mulitcollinearity, but there are a few coefficients in the matrix that exceed this value. The most

extreme of these values is the coefficient between age and age squared, which was to be

expected because of its polynomial functional form. The presence of a few high correlation

coefficient values means that some of my explanatory variables are correlated with each other,

but it does not necessarily mean that mulitcollinearity is a problem. To further assess the

potential issue of mulitcollinearity, I generated a variance inflation factor (VIF) for the each of

the explanatory variables in my model (Figure 4). Most of the VIFs are either below or a little

bit above the value of 5, above which value could indicate a mulitcollinearity problem, with the

mean VIF around 6. Again, the value of the VIF for age and age squared stood out as expected,

and these two values skewed the value of the mean VIF toward a larger number. Although

several VIF values are above 5, most of the values are relatively low and there does not seem to

be an indication that mulitcollinearity is harmful to the model. I feel safe not trying to do

anything about possible indicators of mulitcollinearity because omitted variable bias could be

introduced in doing so.

Next I checked for the presence of serial correlation, which could create the problem of

biasing t-statistics upward and making hypothesis testing less reliable. I first generated a scatter

plot of the residuals of my model against my time variable year to try to assess whether or not

serial correlation could be an issue (Figure 5). There was no obvious evidence of serial

correlation from the scatter plot, so I calculated a Durbin-Watson statistic to check for the

6

Isaac Edmiston

potential problem. The Durbin-Watson statistic that STATA generated came out to be about 1.46,

which was less than the lower bound of 1.53, meaning that I can reject the null hypothesis that

such a Durbin-Watson value came from a regression without positive serial correlation. Because

this value was not very far below the lower bound, the extent of serial correlation is probably not

severe, but it still needs to be addressed. I decided to address the issue by fixing the error term by

running Newey-West standard errors on a generalized least squares (GLS) regression that I

generated for my model (Figure 6). Because my data set is so large, Newey-West should work

well in treating the issue of serial correlation by inflating the standard errors to account for the

standard errors to be too low as a result of serial correlation.

Lastly, I checked my model for the issue of heteroskedasticity, which could result in the

standard errors in my model being biased downward and making hypothesis testing unreliable

because of upward biased t-statistics. Since I did not know where heteroskedasticity could likely

come from, I performed a White Test to search for heteroskedasticity from any source. STATA

produced a very high test statistic, which allowed me to reject the null hypothesis of the presence

of homoscedasticity and indicated that heteroskedasticity was present in my model. In order to

try to treat this problem and because of the large size of my data set, I chose to fix the standard

errors by creating robust standard errors that would inflate the standard errors by differing

amounts based on the level of heteroskedasticity. I ran an OLS regression with robust standard

errors that also stayed consistent with the treatment of the serial correlation issue (Figure 7). The

issue of heteroskedasticity does not bias beta coefficients, so I can confidently interpret these

values in my model.

After addressing the potential issue of mulitcollinearity and treating my regression for the

problems of serial correlation and heteroskedasticity, my model is correctly specified and I can

confidently use it for interpretation. To try to answer my research question of which statistics

7

Isaac Edmiston

matter in determining a player’s productivity in the NBA, I will run a two-sided 95 percent level

of significance test on the independent variables of my model. I will compare the t-statistics of

the variables in my model to the critical t-value of 1.96 to see if I reject or fail to reject my null

hypothesis given below:

H0: explanatory variable does not have an effect on win shares per 48 minutes

HA: explanatory variable does have an effect on win shares per 48 minutes

VARIABLE t-STATISTIC REJECT/FAIL TO REJECTage 1.70 Fail to reject

age_sq -0.80 Fail to rejectmp -6.94 Reject

three_p 4.87 Rejectefg_per 43.83 Rejectlog_ft 33.36 Reject

log_orb 12.72 Rejectlog_drb 9.06 Rejectlog_ast 18.35 Rejectlog_stl 11.71 Rejectlog_tov -42.04 Reject

blk 8.40 Rejectlog_pts -4.51 Reject

Therefore, with 95 percent confidence I can say that I fail to reject the null hypothesis that a

player’s age does not have an effect on his win shares per 48 minutes of play, but I do reject the

null hypotheses that minutes played, three pointers made, effective field goal percentage, free

throws made, offensive rebounds, defensive rebounds, assists, steals, turnovers, blocks, and

points scored per game do not have an effect on win shares per 48 minutes of play. I did not

include the dummy variables for position and team in my chart above because I did not include

them in my regression so that I could analyze how position played and team played on affects a

player’s performance, but instead so that I could take into account position and team effects on

the variables that I was trying to analyze for effect on player performance.

8

Isaac Edmiston

Analysis Description

After conducting my econometric analysis, I was able to finally answer my research

question and see which statistics matter in determining a player’s productivity in the NBA. Most

of the signs of the beta coefficients and the t-values came out to be as expected with the majority

of the statistics having positive effects on win shares per 48 minutes and with turnovers per game

having negative effects, but it was somewhat surprising to see that the coefficient on points per

game was negative. A one percent increase in points scored per game is associated with about a

0.01 unit decrease in win shares per 48 minutes of play. A possible reason for this negative

relationship could be because of the high correlation between points per game and minutes

played per game (see Figure 3). Minutes played per game also had a negative value for its

predicted beta coefficient, and the magnitude of its t-statistic is higher than that of points per

game. Win shares per 48 minutes might decrease as points per game increases partly because of

its negative relationship to minutes played per game, but it could also be due to other factors

such as players who average a high number of points per game may be more prone to having

lower efficiency numbers like shooting percentages because of their high volume of shots than

players who only take a small number of shots. When looking at some of the most significant

variables in the equation, it is not surprising to see turnovers per game having a negative t-

statistic with such a high magnitude or to see that effective field goal percentage has the highest

positive t-statistic in the regression. However, it is surprising that the t-statistic was so high on

free throws made per game, considering that the coefficient of points per game was negative and

that there was a predictably high correlation between free throws and points (see Figure 3). A

one percent increase in free throws made per game is associated with about a 0.05 unit increase

in win shares per 48 minutes of play. This seems to imply that free throws are the most effective

way that a player can score, and this makes logical sense considering that when a player gets

9

Isaac Edmiston

fouled it hurts the opposing team both by potentially giving away points and putting players in

foul trouble. The fact that losing teams often try to foul late in games to try to stop the clock and

keep the game from ending could be another reason why free throws made has one of the largest

effects on win shares per 48 minutes of all the variables. Although I treated the problems of

serial correlation and heteroskedasticity in my econometric analysis, my model is still subject to

shortcomings. Problems could have arisen through omitted variable bias with me failing to

include variables in my regression that matter in determining productivity, or my decision to not

take the natural log of my dependent variable for fear of losing observations from my regression

could have created an issue with my functional form.

Despite the potential shortcomings of my model, the results of my experiment can help

NBA teams make better decisions on which players to pursue and which ones to avoid. When it

might not be very obvious of what contributions a particular player can make to help a team

succeed, teams could look at that player’s effective field goal percentage, free throws made per

game, turnovers committed per game, and even assists per game as potential indicators of how

productive he might be in helping the team win. My model has laid a good groundwork for

further research in trying to find the statistics that matter the most in assessing individual

performance. There will never be a model that is a perfect indicator of how productive a player

is, but models like mine that carry room to be constantly improved upon could potentially change

traditional notions of how players are evaluated in the NBA.

Sources Referenced

Basketball-Reference. Sports Reference LLC. 2016. Web. 24 Mar. 2016.

10

Isaac Edmiston

Berri, David J. “Who Is ‘Most Valuable’? Measuring the Player’s Production of Wins in the

National Basketball Association.” Managerial and Decision Economics 20.8 (1999):

411-427. Web. 12 Apr. 2016.

Hofler, Richard A., and James E. Payne. “Measuring Efficiency in the National Basketball

Association.” Economics Letters 55.2 (1997): 293-299. Web. 22 Mar. 2016.

Kubatko, Justin, et al. “A Starting Point for Analyzing Basketball Statistics.” Journal of

Quantitative Analysis in Sports 3.3 (2007): 1-22. Web. 22 Mar. 2016.

Simmons, Rob, and David J. Berri. “Mixing the Princes and the Paupers: Pay and Performance

in the National Basketball Association.” Labour Economics 18.3 (2011): 381-388. Web.

21 Mar. 2016.

Appendix

Figure 1: Summary statistics for the variables in OLS regression

11

Isaac Edmiston

Figure 2: OLS regression output with dummy variables for position and team

12

Isaac Edmiston

13

Isaac Edmiston

14

Isaac Edmiston

Figure 3: Correlation matrix for variables in OLS regression

15

Isaac Edmiston

Figure 4: Variance Inflation Factor (VIF) output for variables in OLS regression

16

Isaac Edmiston

Figure 5: Scatter plot of residual against year (NBA season)

17

Isaac Edmiston

Figure 6: GLS regression output with Newey-West standard errors

18

Isaac Edmiston

19

Isaac Edmiston

20

Isaac Edmiston

Figure 7: OLS regression output with robust standard errors

21

Isaac Edmiston

22

Documents

Isaac Edmiston Project