Upload
vijaybijaj
View
466
Download
0
Tags:
Embed Size (px)
Citation preview
TRANSFERRING LEAN SIX SIGMA AND DFSS DATA
SIMPLY AND EFFECTIVELY
“Baseball Analytics”
Baseball is the only field of endeavor where a man can succeed three times out of ten and be considered a good performer. ~Ted Williams
4th Annual Design for Six Sigma ConferenceJames M. Wasiloff
Cary YoungUS Army TACOM LCMC
9 February 2009
Agenda
• Introduction of Baseball Analytics• Descriptive statistics and graphical data analysis • Hypothesis development and testing• Analysis of Variance (ANOVA)• Pearson Correlation Coefficient• Simple Linear Regression• Multiple Regression and Best Fit Model • Predictive Models• Statistical Process Control• Next Steps / Application in Other Sports
Introduction
• Why the session…
Better way to understand and teach LSS and DFSS ToolsCan Money Spent = WinsKeep it “Statistically Simple”Just the Beginning
Baseball quote…
The charm of baseball is that, dull as it may be on the field, it is endlessly fascinating as a rehash. ~Jim Murray
Test of Hypothesis• Null Hypothesis:
Ho12
MLB example: Ho: Mean Batting Average of the NY Yankees from 2006-2008 equals the Mean Batting Average of the Tampa Bay Rays from 2006-2008
• Alternative Hypothesis:
Ha: 12 or 12
“They are not the same”
During my 18 years I came to bat almost 10,000 times. I struck out about 1,700 times and walked maybe 1,800 times. You figure a ballplayer will average about 500 at bats a season. That means I played seven years without ever hitting the ball. ~Mickey Mantle, 1970
Batting Stats
TeamBatting Averages
2006 2007 2008
Baltimore 0.277 0.272 0.267
Boston 0.269 0.279 0.280
Chicago Sox 0.280 0.246 0.263
Cleveland 0.280 0.268 0.262
Detroit 0.274 0.287 0.271
Kansas City 0.271 0.261 0.269
LA Angeles 0.274 0.284 0.268
Minnesota 0.287 0.264 0.279
NY Yankees 0.285 0.290 0.271
Oakland 0.260 0.256 0.242
Seattle 0.272 0.287 0.265
Tampa Bay 0.255 0.268 0.260
Texas 0.278 0.263 0.283
Toronto 0.284 0.259 0.264
TeamBatting Averages
2006 2007 2008
Arizona 0.267 0.250 0.251
Atlanta 0.270 0.275 0.270
Chicago Cubs 0.268 0.271 0.278
Cincinnati 0.257 0.267 0.247
Colorado 0.270 0.280 0.263
Florida 0.264 0.267 0.254
Houston 0.255 0.260 0.263
LA Dodgers 0.276 0.275 0.264
Milwaukee 0.258 0.262 0.253
NY Mets 0.264 0.275 0.266
Philadelphia 0.267 0.274 0.255
Pittsburgh 0.263 0.263 0.258
San Diego 0.263 0.251 0.250
San Francisco 0.259 0.254 0.262
St. Louis 0.269 0.274 0.281
Washington 0.262 0.256 0.251
American League National League
It ain't like football. You can't make up no trick plays. ~Yogi Berra
Test of Hypothesis• Are the batting averages of the National League different than
the American League? • T-test
• Interpretation: “P Low, null must go – P High, null will fly”
Two-Sample T-Test and CI: AL, NL
N Mean StDev SE MeanAL 14 0.27086 0.00772 0.0021NL 16 0.26356 0.00704 0.0018
Difference = mu (AL) - mu (NL)Estimate for difference: 0.00729595% CI for difference: (0.001717, 0.012872)T-Test of difference = 0 (vs not =): T-Value = 2.69 P-Value = 0.012
Are Salaries Correlated to Team Performance?
• The trend is…
• Problem statement:– Will increasing player salaries lead to more
success?
Baseball was the major American sport in which money bought success. George Will, Moneyball
2008 MLB Salaries and Win Count
Team Total Salary Wins Team Total Salary Wins
NY Yankees $207,108,489 89 San Francisco $76,194,000 72
NY Mets $137,391,376 89 Milwaukee $74,687,499 90
Detroit $137,290,196 74 Cincinnati $74,117,695 74
Boston $133,220,112 95 San Diego $72,626,616 63
Chicago Sox $121,189,332 89 Colorado $68,655,500 74
LA Angels $118,825,333 100 Baltimore $66,806,249 68
LA Dodgers $118,188,536 84 Texas $66,312,326 79
Chicago Cubs $117,954,333 97 Arizona $66,202,712 82
Seattle $116,876,482 61 Kansas City $57,855,500 75
Atlanta $102,849,666 72 Minnesota $56,932,766 88
St. Louis $99,624,449 86 Washington $54,166,000 59
Toronto $97,001,500 86 Pittsburgh $48,689,783 67
Philadelphia $95,479,880 92 Oakland $47,167,126 75
Houston $88,930,414 86 Tampa Bay $43,422,997 97
Cleveland $78,970,066 81 Florida $22,650,000 84
Correlation Between Salary and Wins?
Total Salary 2008
Win
s in
2008
200000000150000000100000000500000000
100
90
80
70
60
Scatter Plot of Salary Versus Games Won -2008
Use these Derivation Formulae or?
Use This Simple Graphic?Pearson Correlation Coefficient Definition
Values of r
Correlation Coefficient
• Graphic approximation… what do you think?
• Minitab results: Pearson correlation of Total Salary 2008 and Wins in 2008 = 0.323
• Interpretation of results
American League West in 2002(“Moneyball” Data Set)
Team Wins Payroll
Oakland 103 $41,942,665
Anaheim 99 $62,757,041
Seattle 93 $86,084,710
Texas 73 $106,915,180
Pearson correlation of Wins and Payroll = -0.928
ANOVA
• Null Hypothesis:
Ho12 = 3n
MLB example: Ho: Mean Batting Average of the NY Yankees equals the Mean Batting Average of the Tampa Bay Rays equals the Mean Batting Average of the NY Mets equals the Mean Batting Average of the …
• Alternative Hypothesis:
Ha: At least on kis different from one other k
MLB example: At least one team has a Mean Batting Average different from all other teams
A baseball fan has the digestive apparatus of a billy goat. He can, and does, devour any set of diamond statistics with insatiable appetite and then nuzzles hungrily for more. ~Arthur Daley
Regression Analysis• Is it possible to model and predict number of wins for a
season based on statistical parameters?• The initial simple linear regression model, 2002 data:
Team Wins Payroll
Oakland 103 $41,942,665
Anaheim 99 $62,757,041
Seattle 93 $86,084,710
Texas 73 $106,915,180
Multiple Regression and Best Fit Model
• Regression studies the relationship between the mean value of a random variable and the corresponding values of one or more independent variables.
– A model for predicting one variable from another.
– A statistical analysis assessing the association between two variables.Regression analysis is a method of analysis that enables you to quantify the relationship between two or more variables (X) and (Y) by fitting a line or plane through all the points such that they are evenly distributed about the line or plane.
• Multiple regression is a method of determining the relationship between a continuous process output (Y) and several factors (Xs).
American League West in 2002(“Moneyball” Data Set)
Team Wins Payroll
Oakland 103 $41,942,665
Anaheim 99 $62,757,041
Seattle 93 $86,084,710
Texas 73 $106,915,180
Exploratory Data AnalysisW
ins
in 2
008 100
80
60
Walks72
060
048
0
2000
0000
0
1000
0000
0
0 2.41.60.8 1000
800
600
Tota
l Sal
ary
2008
200000000
100000000
0
Ave
rage
Age
30
28
26
New
2.4
1.6
0.8
Save
s 60
45
30
Run
s (P
) 1000
800
600
Wins in 2008
Wal
ks
1008060
720
600
480
Total Salary 2008 Average Age
302826
New Saves
604530
Runs (P)
Matrix Plot of Wins in 2008, Total Salary, Average Age, New, ...
What does it mean?
Testing the Predictive Model
• Tigers 2008 data…– Here is the predictive transfer function from
Minitab:
– Testing on 2008 Data:• Actual win count = 74• Predicted win count = 74.26
Wins = 32.1 + 1.48 Average Age - 34.5 Team ERA + 154 Team Batting Average + 0.582 Saves (P) + 0.150 Runs (P) - 0.0202 Walks (P) - 0.0087 SO (P)
Statistical Process Control and Statistical Thinking
• “Statistical process control is the application of statistical methods to identify and control the special cause of variation in a process” – iSixSigma.com
• Statistical Thinking: The process of using wide ranging and interacting data to understand processes, problems, and solutions.
– The opposite of “one factor at a time” where the tendency is to change one factor and “see” what happens.
– Statistical thinking is the tendency to want to understand situational phenomena over a wide range of data where several control factors may be interacting at once to produce and outcome.
– Common cause variation becomes your friend and special cause variation your enemy.
– Attribute judgements of good and bad are replaced with estimates of significance with given confidence.
Games Played
Indiv
idual V
alu
e
31161
0
-50
-100
_X=0
UCL=24.2
LCL=-24.2
Games Played
Movin
g R
ange
30151
100
50
0
__MR=9.1
UCL=29.7
LCL=0
1111
11
11
11
11
1
1
Statistical Process Control ModelPercent Variation from "Games Won" Target
Example 1: Notional Data – Status at Game 37
Range outside UCL indicates “out of control” -Need to investigate “special cause”
Games Played
Indiv
idual V
alu
e
464136312621161161
20
0
-20
_X=0
UCL=12.44
LCL=-12.44
Games Played
Movin
g R
ange
464136312621161161
20
10
0
__MR=4.68
UCL=15.28
LCL=0
1111
111
11
1
111
1
1
1
11
Statistical Process Control ModelPercent Variation from "Games Won" Target
Games Played
Y-D
ata
50403020100
30
20
10
0
-10
-20
-30
Variable
Percent off Target
Win targetActual Wins
Scatterplot of Win target, Actual Wins, Percent off vs Games Played
Which Method is Earliest at Detecting a “Special Cause?
Old Way
Analytics Approach
Next Steps
• Additional MLB Analytics
• System approach to baseball
• Other sports?– Golf Fishbone Cause and Effect Analysis
example
Baseball statistics are like a girl in a bikini. They show a lot, but not everything. ~Toby Harrah, 1983
Wasiloff – Young Baseball Analytics“Systems Approach to Batting”
Analytic Based Reactive Batting Problem Solving
Pre Emptive Batting Problem Discovery
Optimal Batting System Design
Batter
Accesso
ries
Fu
nd
amen
tals
Bat
Stad
ium
OurMission
Develop world class batters
who use consistent, disciplined, and proven methods, of eliminating or preventing hitting problems
thereby providing our fans excellence in batting, league leading run creation resulting in high level fan satisfaction
Systems Based PotentialCauses
•Lean Six Sigma Analytics•Design for Six Sigma•Statistical Methods•Correlation/Regression Analysis•Design of Experiments•VOC / QFD•Taguchi Methods•Innovation Methods
Questions / comments?
Thanks!
Baseball? It's just a game - as simple as a ball and a bat. Yet, as complex as the American spirit it symbolizes. It's a sport, business - and sometimes even religion. ~Ernie Harwell, "The Game for All America," 1955