15
1 Predicting the Outcome of a Basketball Game for Knowledge and Profit Jason Kholodnov

honors_paper

Embed Size (px)

Citation preview

Page 1: honors_paper

1

Predicting the Outcome of a Basketball Game forKnowledge and Profit

Jason Kholodnov

Page 2: honors_paper

2

Table of Contents

1. Project Purpose 3

2. Acquiring Data from Internet Sources 4

3. Determining Strength of Teams 5-6

4. 2014 NBA Playoffs Prediction 7

5. 2015 NBA Playoffs Predictions 8

6. Analyzing the Impact of Money on Performance 9

7. Analyzing Elo's Failures 10

7. Determining Players' Performances 11-12

8. Using Player Statistics to Categorize Teams 13

9. What's Next 14

10. Technologies Used 15

Page 3: honors_paper

3

Project Purpose

The purpose of this project is to attempt to create a platform which analyzes statistics between professional Basketball players and teams, and by doing so predict the outcome of a future game. To dothis, a platform with three major components was developed in order to Scrape, Develop, and Analyze data from 10 seasons worth of NBA games.

Page 4: honors_paper

4

Acquiring Data from Internet Sources

To begin this project, a database of game statistics first needed to be generated. ESPN.com was selected due to its uniform formatting style between various teams and games. A Web Scraping1 component was developed in Python3 which utilizes BeautifulSoup4 to generate a parse tree for the HTML content. In order to scrape all seasons worth of data, a recursive scraping algorithm was developed, which functions like so:

Scrape all of ESPN.go.com/nba/standings to acquire URL's of each team. Scrape all of the previously acquired team URL's in parallel to acquire the URLs of each game. Scrape all of the previously acquired game URL's in parallel to acquire the data from each

game. Store the data acquired by scraper threads in a relational SQLite3 database.

Several problems were encountered along the way of developing this program, the first and foremost being lack of uniformity between some teams homepage formatting. The most problematic of teams happens to be the Charlotte Bobcats, or as they are known now the Charlotte Hornets. Due to a team name change this last season, all of the hyperlinks on ESPN were incorrect and did not display within the parse tree. In order to solve for this a page parsing algorithm was developed to determine if either of the teams in the game being parsed were “Charlotte”, if this proved to be true then several values were hardcoded in order to allow the rest of the scraping component to function correctly.

The second most complicated issue which arose was SQLite3 multi-threaded database writes. Due to the program launching upwards of 14,000 game scraping threads at a time (in a pipelined architecture), the database writes experienced a large amount of data contention and thus causing the database to lock. The solution to this problem just required two steps, which works in the following way:

If the thread attempted a database write but received a DATABASE LOCKED error, the thread recursively calls the same function with the same parameters. If this sequence repeats more thantwo times, the thread exits and stores the game ID in an error message.

If any threads exited with an error, then the program will output informing how many games were not able to be scraped. The program will have to run again in order to attempt to scrape these games.

This method of error handling allows for “Eventual Validity” of the database. What this means is that although one pass of the program will not achieve an entirely correct database with all games scraped, the size of games which still need to be scraped decreases every successful run of the program. In orderto scrape one season worth of games, a total of 2-3 runs are required.

1- A technique of extracting information from websites

Page 5: honors_paper

5

Determining Strength of Teams

The second component in this project is a team strength generator. In order to determine how important the outcome of a previous game was, it is necessary to have some metric by which we can rank teams. In order to do this, an Elo ranking system2 was adopted, Elo is a metric developed by Arpad Elo, a Hungarian physicist. This ranking system functions very well when there are highly ranked individuals competing against low ranked individuals, and the rating updates should scale according to the difference in rankings.

The formula reads as follows:

Ea=1

1+10( RB− RA

400 )Eb=

1

1+10( RA− RB

400 )Where Ex is the expected chance that X will be the victor, and RA and RB denote the current ratings of Team A and Team B respectively.

To calculate the rating change that occurs in the event of a victory for team X, we use the following:

R0=1500R x=(Rx) (n−1 )

+32 (W −E x )

Where W is a binary value, 1 = win, 0 = loss.

For example: 1500 rating vs 1500 rating: The victor of this will gain 25 points, the loser will lose 25 points 1600 rating vs 1400 rating: If the 1400 wins, ~40 points will change hands, while if the 1600

wins only ~15 points will change hands.

In order to implement this component, each team was given an initial rating of 1500, and an algorithm was developed which functioned in the following steps:

Select all days on which any games were played. Select all games which were played on each day. For each day on which games were played, spawn an appropriate number of threads so that

each game's information has its own worker thread. Update the database to reflect the teams' ELO Rating.

2- http://en.wikipedia.org/wiki/Elo_rating_system

Page 6: honors_paper

6

This implementation method solves one key problem, when all ten seasons were scraped, upwards of 13000 games needed to have Elo ratings generated. If this implementation was performed sequentially, a runtime of 200 minutes would be expected. By performing this in parallel by decomposing the problem set into days in which games were played, we are able to reduce the problemset into smaller parallelizable chunks and reduce the runtime to 45 minutes on a quad-core machine.

Analyzing the ELO output, we see an interesting trend for each season. Throughout the season, each team's maximum ELO rating was recorded, and at the end of each season the top sixteen teams which made it into the playoffs were put in order. Here is an example of the outputs from the 2013-2014 season:

Team Abbreviation Max ELO Achievedsa 1827.277okc 1776.423mia 1755.340lac 1751.937ind 1746.647hou 1739.535por 1735.329gsw 1696.646mem 1686.782bkn 1686.189chi 1668.387dal 1649.538tor 1647.060cha 1623.499was 1588.494atl 1568.493

Page 7: honors_paper

7

2014 NBA Playoff Predictions

By comparing the maximum Elo rating each team achieved throughout the regular season, we are able to make predictions of the outcomes of the playoff brackets.

RO16: RO8: RO4: FinalsSA > Dal correct SA > POR correct MIA > IND correct SA > MIAHOU > POR incorrect OKC > LAC correct SA > OKC correctLAC > GSW correct IND > WAS correctOKC > MEM correct MIA > BKN correctIND > ATL correctCHI > WAS correctBKN > TOR correctMIA > CHA correct

Using these predictions we are able to achieve an accuracy of 15/16 series predicted correctly. By using the maximum Elo rating each team achieved throughout the season, we are effectively determining the team's peak performance. Because the playoffs are so high stakes, it is expected that each team will perform at or near their peak performance. Although any team may win an individual game, with a seven-game series the chances that the stronger team will win four games is much higher. We see that the only case in which maximum Elo did not accurately predict the victor of a series was between Houston and Portland, although Houston had the greater Elo rating than Portland, the difference was negligible, leading to an incredibly close six game series in which three games went to overtime and the average spread was 4.7 points.

Page 8: honors_paper

8

2015 NBA Playoffs Predictions

Comparing this season's maximum ELO's with the previous season, we see a large difference in team strengths.

2014 Playoffs 2015 Playoffs sa 1827.27 lac 1741.23okc 1776.423 mem 1729.13mia 1755.340 atl 1718.02lac 1751.937 hou 1716.41ind 1746.647 gs 1702.32

We see that the peak strength of teams in the 2013-2014 season was significantly higher than the peak strength of teams in the 2014-2015 season. This would imply that throughout the season, no team was dominant for large streaks of time, and the team strengths were more balanced.

By comparing the ELO differences in the 2014 playoffs to those we expect in the 2015 playoffs, we canexpect to see a very interesting series between Memphis and Los Angeles in the semifinals, as well as amuch closer finals this year.

Page 9: honors_paper

9

Analyzing the Impact of Money on Performance

(Data for 2015 season.)

An interesting comparison between teams arises when the total salaries of teams is compared to the maximum Elo rating that team was able to achieve during a season. We see a tiered distribution of teams, where those which spent less than ~65M were not able to achieve the peaks that additional money can bring. We see two outliers in the data set, Atlanta and Brooklyn. Atlanta performed exceptionally well for a team with its salary, while Brooklyn performed very poorly for a team with thehighest salary in the league. These results help explain the predicted brackets, and give us another pieceof information about the teams.

The teams which are circled in Red denote the teams that made it into the playoffs. We see that 14/18 teams which spent more than 70 million or more made it into the playoffs, while only 3 that spent less than 70 million were able to make the playoffs.

Page 10: honors_paper

10

Analyzing Elo's Failures

In addition to Elo's effectiveness in a seven-game series, it is effective in predicting single games. By comparing the Elo rating of each team in all games with the result of the game, we are able to predict the victor of the game with an accuracy of 65.6%. This is significantly lower than the 15/16 accuracy we are able to achieve through a seven-game series, but still a very good result.

In order to improve our overall accuracy, we must look at the games in which a prediction usingpurely Elo did not prove to be accurate.

We see that the Elo differential is skewed heavily toward the lower end of the plot. This means that as the Elo differential between teams increases, the likelihood that the lower rated team will prove to be victorious declines rapidly. This was to be expected through the definition of Elo rating, but usingthis graph will allow us to develop a percent chance the team will win based on Elo differential.

In order to analyze why the lower rated team was able to be victorious, we must move toward a lower level view than simply team ratings. Since a whole is just the sum of its parts, we will look at the performances of individual players in these games to determine what factors led to the underdog coming out on top.

Page 11: honors_paper

11

Determining Players' Performances

Now that we have developed a ranking system for each team, we must determine how well eachplayer played in each game, as well as determine how they performed against the strength of the other team. We must do so in order to develop a player-based simulation in which the team's performance is equal to the sum of each of its players' performances. In order to do this effectively, we must gauge a players' performance based on their previous performances as well as a standardized performance metric. The metrics which we will be using to measure the players' performances will be:

Performance Index Rating (PIR)3

Normalized Performance Rating(NPR)4

Both of these metrics will use all measurable statistics from all games for which data has been tracked.

Minutes Played, Field Goals (M/A), Three Pointers(M/A), Free Throws(M/A), Offensive Rebounds, Defensive Rebounds, Assists, Steals, Turnovers, Points, Plus Minus

In order to calculate the PIR values for each player for each game, we will use this fairly simpleformula.

PIR= (Points+2∗Rebounds+Assists+Steals+2∗Blocks+Fouls Drawn )

− (Missed FG' s+Missed FT ' s+Turnovers+Fouls+ShotsBlocked )

This provides us with a generic way to determine a players' performance in a game, including both his offensive and defensive contributions. The use of this metric to determine a team's performance is also quite reliable. By generating a Team Performance Rating (TPR) using the following formula:

TPR= ∑i=1

i=numplayers

Player i . PIR

When comparing the TPR's of teams within each game, we are able to achieve an incredible 90.6% accuracy rating of predicting the winner of a game. Although this value is incredibly accurate, the values from which it is calculated are intrinsically related to the outcome of a game. If we are able to predict individual players' PIR with no knowledge of a future game, we will have an incredibly accurate method of determining individual players' performances as well as the outcome of any individual game. We will return to this TPR statistic when we begin to generate simulations for each player.

3- Performance Index Rating (PIR) is a basketball statistical formula that is used in a variety of European Basketball leagues. It is similar but not identical to the Efficiency rating used by the NBA.4- Not related to National Public Radio.

Page 12: honors_paper

12

To compute the NPR values for each player within each game, we must develop an algorithm which functions in the following way:

Create a player object for each player in the league. Go through each game in order.

◦ For each game, attribute each players' performances to the appropriate player.◦ Using the players μ and σ values up to the game being measured, determine how many

standard deviations from mean the players performance for each statistic was.▪ The NPR value for this game will be:

∑i=0

i=num−variables

(VariablePerformancei−μi

σ i)

By doing so, we can go through each player's game performances sequentially and determine the mean and standard deviation of each of these metrics at the time that any game is played. We can then compare the player's performance in that particular game to determine where along their standard curve they fall by using a fairly simple algorithm:

This NPR value will tell us how well this player performed in regards to his average. A positivevalue will imply that he performed better than an average game for him, while a negative value implies the opposite. We will use this value to predict a player's momentum. For example if a player has the following NPR values for his previous 5 games:

• 0.5• 1.6• 6.4• 16.4• 15.8

We would imply that the player has been “Heating up”, and would predict that his hot streak will continue through to his next game. It is important to note that some players may be able to generate inflated NPR values due to simply having very few minutes played.

• Example: A player with an average of 1 minute played, and 0 in all statistics who played for 5 minutes one game and managed to score a few points would be able to achieve astronomical NPR values which may not even be achievable by Lebron James.

We will account for this effect by multiplying the NPR value by the percentage of minutes played within a game. This will scale down a player's performance to account for their actual game contribution, while still displaying their personal momentum.

Individually, the NPR statistic does not tell us much about the outcome of a game, since it simply projects a player's game performance against his previous performances, but it does allow us to predict with some certainty how well the player has been performing recently, and thus generate a 5 and 10-game moving momentum.

Page 13: honors_paper

13

Using Player Statistics to Categorize Teams

By using a team's player-level statistics, we can try to categorize teams into whether the team won or not. Using a SVM5 model like so:

svm(Win~fga+tpa+fta+oreb+dreb+assist+steal+block+turnover+fouls+npr)

Note that all information about points gained is removed in order to prevent direct influence into the SVM Model. Using this model, we can make a prediction of the teams within our data set, and output this prediction as a table where the SVM proved correct, and where it failed.

> independent = fga + tpa + fta + oreb + dreb + assist + steal + block + turnover + fouls + npr> model = svm(Win~independent, data = overall)> pred = predict(model, independent, type = "class")> table(pred = round(pred), true=dependent)

Actualpred 0 1 0 776 173 1 122 725

These bolded figures in our table are the important output of the SVM. This tells us that our SVM model places 776 /949 losing teams correctly, and 725/847 winning team correctly. This means that if we can develop a model which predicts a team's performances in the tracked variables,

FGA, TPA, FTA, OREB, DREB, ASSIST, STEAL, BLOCK, TURNOVER, FOULS, NPR

Then we will be able to place that point into N-dimensional space on either side of a N-dimensional plane. Based on which side of the N-dimensional plane the point lands, then we will predict that team will win or lose the game that was simulated. We can now use this model to predict a team's chances of winning a game, based on simulated values for each player.

5 Support Vector Models try to classify vectors in N dimensions based on a dependent variable.

Page 14: honors_paper

14

What's Next

-Develop a simulation which tries to give a predicted value for each team's performances. Using the SVM, predict whether the team's performance classifies as a Win or Loss.

-Dig deeper into Salary ~ Performance player-level statistics

-Compare predictions against Las Vegas bookie odds, determine approximate returns if I bet $100 on all predicted teams.

Page 15: honors_paper

15

Technologies Used

SQLite3 : Lightweight database which has SQL syntax.

Git : Version Control.

Docker : Virtual Environment which contains all library dependencies, packages, data to run the program.

Python : Web Scraping.

C++ : Team Elo, NPR, PIR generator, Game simulations.

R : Statistics on gathered data.