Download pptx - This is S-030: Applied Regression and Data Analysis For those of you looking for S-052 with John Willett, we’ve switched rooms. Go upstairs to Larsen 106

© Andrew Ho, Harvard Graduate School of Education

This isS-030: Applied Regression and Data Analysis

For those of you looking for S-052 with John Willett, we’ve switched rooms.

Go upstairs to Larsen 106.

Today’s class starts at 10:15. All subsequent classes will start promptly at 10:10.

Unit 1 / Page 1

Room Switch!

Welcome to S-030!

Unit 1 / Page 2© Andrew Ho, Harvard Graduate School of Educationhttp://xkcd.com/715/

Building Perception:1. A well designed figure should be the goal of every analysis.2. Know your axes.3. Know your axis ranges.4. Anthropomorphize yourself as a point on a figure and ask, who am I, and what story do I tell?5. Outliers are observations to be explained not blemishes to be erased or fixed.


How you’ll spend your time in S-030, Part I: What we’ll do in class

1) Research Questions and Data Sets

Lectures with your questions:Active participation is encouraged, time permitting

2) Statistical Conceptsand Content

3) Interpretation and Presentation of Results

Each unit addresses the following:

Note-taking:On laptops (in laptop zones at the edges of

the lecture hall) or printouts of handouts

Please be courteous:No cellphones, email, websurfing, IM, texting or other electronic distractions during class.Attend class. On time.

Unit 1 / Page 3

http://images.google.com/imgres?imgurl=http://www.officehacks.org/wp-content/uploads/2007/02/no-cell-phones.jpg&imgrefurl=http://www.officehacks.org/page/2/&h=230&w=233&sz=18&hl=en&start=3&sig2=wFThn_ngxEk_KWGEtKE7LA&um=1&tbnid=9arNp46mC9q8cM:&tbnh=108&tbnw=109&ei=xuqER5qIDqPgeK-xgEE&prev=/images?q=silence+cell+phones&svnum=30&um=1&hl=en&c2coff=1&rls=GGLD,GGLD:2004-30,GGLD:en

http://images.google.com/imgres?imgurl=http://www.rimarkable.com/images/no-blackberry-curve.jpg&imgrefurl=http://www.rimarkable.com/sprint-and-verizon-are-missing-the-ship-with-consumer-oriented-blackberrys&h=285&w=168&sz=42&hl=en&start=2&sig2=kRnWfuEbjgcgl62S1ITmHA&um=1&tbnid=3Z_QzCn3H846aM:&tbnh=115&tbnw=68&ei=NOuER7rROoe0eoeznVQ&prev=/images?q=no+blackberry&svnum=30&um=1&hl=en&c2coff=1&rls=GGLD,GGLD:2004-30,GGLD:en

http://images.google.com/imgres?imgurl=http://aqna86.tripod.com/email.jpg&imgrefurl=http://aqna86.tripod.com/&h=439&w=600&sz=16&hl=en&start=268&sig2=A3fctZB5NisAQXI210JnRA&um=1&tbnid=sPwO92ojDf7Y1M:&tbnh=99&tbnw=135&ei=vOuER5Nph7R6h7OdVA&prev=/images?q=no+email&start=252&ndsp=18&svnum=30&um=1&hl=en&c2coff=1&rls=GGLD,GGLD:2004-30,GGLD:en&sa=N



http://images.google.com/imgres?imgurl=http://jmz.iki.fi/images/blog/06/ie7.jpg&imgrefurl=http://jmz.iki.fi/blog/programming/add_php_and_cpan_searches_to_ie7&h=300&w=300&sz=12&hl=en&start=2&sig2=uRABWVx-4zVsdw7HeNHWAw&um=1&tbnid=4CRaF-4mtyTiMM:&tbnh=116&tbnw=116&ei=ReyER_n7HpOqeqGdlVA&prev=/images?q=internet+explorer&svnum=30&um=1&hl=en&c2coff=1&rls=GGLD,GGLD:2004-30,GGLD:en




How you’ll spend time in S-030 Part II: What you’ll do outside of class

Assignments• 6 Homework Assignments (~⅔ grade), pairs

mandatory except for the 3rd assignment (done individually).

• 1 final project (~⅓ grade), completed individually.• Due e-submitted on iSites by 1PM on the date noted

on the syllabus. • No late assignments accepted.

Individual and group work• Work in study groups as you’d like, but write and

submit HWs in pairs (or individually for the 3rd assignment and final).

• Find partners and study groups on your own, in sections, by emailing us, or by using this google doc (linked on the iSite under the Assignments tab): https://docs.google.com/spreadsheet/ccc?key=0AuXHUPzC9fGPdGdSc0VacnFzUnI5YmswbkoxNGE3YWc

• Reference any collaborations beyond yourself and your partner with a footnote on the first page of your completed assignment.

Where to get help: - Lectures and the Course Website - Partners and Study Groups - Your “Homeroom” TF and weekly section - Any other TF and other sections. - Professor office hours: http://andrew-ho-office-hours.wikispaces.com/

Course website: http://my.gse.harvard.edu/course/gse-s030/2012/spring

No required reading, but optionally:

…and review course slides.

Unit 1 / Page 4

https://docs.google.com/spreadsheet/ccc?key=0AuXHUPzC9fGPdGdSc0VacnFzUnI5YmswbkoxNGE3YWc




http://andrew-ho-office-hours.wikispaces.com/

http://andrew-ho-office-hours.wikispaces.com/

http://my.gse.harvard.edu/course/gse-s030/2012/spring

http://my.gse.harvard.edu/course/gse-s030/2012/spring


The First Three Weeks

Unit 1 / Page 5

January 2012Sunday Monday Tuesday Wednesday Thursday Friday Saturday

22: Week 1No Sections

23 24 – Class 1 25 – “Doodle” poll for sections sent to you.

26 – Class 2 27 – 1PM Doodle poll for sections due.

28

29: Week 2Sections Begin

30–Asgn 1 Out 31 – Class 3SECTIONS

February 2012Sunday Monday Tuesday Wednesday Thursday Friday Saturday

Week 2 (cont.) 1SECTIONS

2 – Class 4SECTIONS

3

4

5: Week 3 6

7 – Class 5 SECTIONS

8 SECTIONS

9 – Class 6 SECTIONS

10–Drop/ Grade Change DeadlineAssn 1 Due, 1PM

11


Unit 1: Introduction to Simple Linear Regression Class 2 Class 3 Class 4

Unit 1 / Page 6http://xkcd.com/1007/

Assignment #1 has been posted. Assignment Data Template Grading guidelines Partnering spreadsheetSections begin this weekPriya’s section meets today at 2:40 in Gutman 440 (this week only)All other sections in Gutman 302.

http://xkcd.com/1007/


Where is Unit 1 in our 11-Unit Sequence?

Unit 6:The basics of

multiple regression

Unit 7:Statistical control in depth:Correlation and collinearity

Unit 10:Interaction and quadratic effects

Unit 8:Categorical predictors I:

Dichotomies

Unit 9:Categorical predictors II:

Polychotomies

Unit 11:Regression in practice. Common Extensions.

Unit 1:Introduction to

simple linear regression

Unit 2:Correlation

and causality

Unit 3:Inference for the regression model

Building a solid

foundation

Unit 4:Regression assumptions:Evaluating their tenability

Unit 5:Transformations

to achieve linearity

Mastering the

subtleties

Adding additional predictors

Generalizing to other types of predictors and

effects

Pulling it all

together

Unit 1 / Page 7

How Can Regression Help Us? Prediction.

• Prediction – Given lifestyle and background, I can predict your lifespan.– Given polling results to date, I can predict who will win the Florida primary.– Given previous award wins and nominations, I can predict who will win an

Oscar.

Unit 1 / Page 8© Andrew Ho, Harvard Graduate School of Educationhttp://www.livingto100.com/

http://elections.nytimes.com/2012/fivethirtyeight/primaries/floridahttp://carpetbagger.blogs.nytimes.com/2011/02/24/4-rules-to-win-your-oscar-pool/

http://www.livingto100.com/

http://carpetbagger.blogs.nytimes.com/2011/02/24/4-rules-to-win-your-oscar-pool/




How Can Regression Help Us? Association and Theory Building.

• Establishing association and relative strengths of associations between variables.

• Building theory: Which constructs are similar and which dissimilar?

Unit 1 / Page 9http://www.education.com/reference/article/triarchic-theory-of-intelligence/http://www.socialresearchmethods.net/kb/mtmmmat.phphttp://psycnet.apa.org/index.cfm?fa=buy.optionToBuy&id=2011-19822-001


http://www.education.com/reference/article/triarchic-theory-of-intelligence/

http://www.education.com/reference/article/triarchic-theory-of-intelligence/

http://www.socialresearchmethods.net/kb/mtmmmat.php

http://www.socialresearchmethods.net/kb/mtmmmat.php

http://psycnet.apa.org/index.cfm?fa=buy.optionToBuy&id=2011-19822-001

http://psycnet.apa.org/index.cfm?fa=buy.optionToBuy&id=2011-19822-001


How Can Regression Help Us? Understanding Cause and Effect.

• Understanding Cause and Effect (Be careful, here.)– If we link teacher pay to student performance, do students learn more?– If we increase our movie’s advertising budget, how much will our box office

revenues grow?– If my parents had gone to college, what might my GPA have been?

• The causal inference must be supported by a *design*, not merely a model (more on this in Unit 2).

• Consider causal inference a second layer of inference, above and beyond prediction and association, that requires a higher standard of evidence and argument.

Unit 1 / Page 10


Three Types of Variables

Question Predictor(s)

A variable or variables whose effects or predictive potential you wish to examine.Graphically: On the horizontal axis.Also: Independent Variable(s)

Outcome Variable

A variable that you wish to predict, have an effect upon, or use to measure the effects of the predictors.Graphically: On the vertical axis.Also: Response Variable Dependent Variable

Covariate(s)/Control(s)

Additional predictors whose effects you would like to adjust for, account for, or “statistically control for.”Graphically: On the legend.Also: Control Predictors Extraneous Variables

Unit 1 / Page 11

The distinction between Predictors () and Covariates () is substantive, not statistical. Sometimes, no distinction is made between them.

The distinction between and is substantive AND statistical.


• How well does the overall critical acclaim of a movie predict whether it will win an Oscar?– Question Predictor: Critical Acclaim (Rotten Tomatoes Score, what

percent of critics like/recommend the movie?)– Outcome Variable: Oscar Win– Control Predictor: Golden Globe (Alternative Award) Win– Boxes and Arrows (on the right)

– In an informal prediction equation ( on the left)Oscar_Win = Critical_Acclaim

+ Golden_Globe– In a graph ( on the vertical axis, question – predictor usually on the horizontal axis.)

Example 1: Critical Acclaim and Oscar Odds

cclaimCritical_A Oscar_Win

Unit 1 / Page 12

beGolden_Glo

0% 20% 40% 60% 80% 100%0%

10%20%30%40%50%60%70%80%90%

100%

Critical_Acclaim (Tomatometer)O

scar

Win

Pro

babi

lity

Golden Globe Win

Golden Globe Loss


Example 2: The SATs and Freshman GPA

• In 2001, an influential study on University of California students prompted the redesign of the SATs.

• The variables: Freshman GPA (FGPA), High School GPA (HSGPA), SAT I, (Achieve-ment Tests) SAT II (Subject Tests).

HSGPAFGPA

SATI

SATII

SATIISATIHSGPAFGPA

Unit 1 / Page 13

Model Language: How well does the SAT I predict freshman GPA when accounting for (after controlling for) high school GPA and SAT II scores?

http://www.ucop.edu/sas/research/researchandplanning/pdf/sat_study.pdf

Statistical models will almost never fit the data perfectly. Interpreting their results requires considering:

• Other structural (systematic) components (not included in the model or not measured)

• Sampling Variation – Differences between the sample data that you have and the population to which you hope to generalize.

• Neglecting these is to neglect systematic vs. random error respectively.

Models: Simplified representations of relationships among variables

Mathematical/Physical Models Statistical models

Geometry: Area of a Square = (length)^2Or Kinetic Energy: KE = (1/2)mass*velocity^2

Mathematical models and most Physical models are deterministic—

• Some are linear; some nonlinear, but…• All squares behave this way—once we know

the “rule,” we can use it to fit the model to data perfectly.

• We might be able to say: The data fit the model vs. the model fit the data. The former is not a widely accepted expression in statistics.

Modeling people, organizations, or any type of social or cultural unit.

Unit 1 / Page 14© Andrew Ho, Harvard Graduate School of Education

Outcome = Structural Component (Predictors) + Residual (Error)Step 1: Identify the structural (systematic) components and note their predictive utility for the

outcome.Step 2: Assess how well we did by examining the residual (error) terms.Step 3: Evaluate whether an alternative structural model may better explain the outcome.


The Art of Statistical Modeling

• Statistical mantra (attributed to George Box): All models are wrong; some are useful.

• Our goal is less a single statistical model but a clearly written account of the journey we take to get there.

• We take the perspective that all statistical models are vast oversimplifications that may nonetheless lead us to make accurate predictions, increase our understanding of relationships and mechanisms, and build useful theory.

• Our most common and dangerous errors (and the ones we grade most harshly) typically involve overstating our results, sliding into deterministic, unwarranted causal language, and drifting lazily towards a simple single answer. Unit 1 / Page 15

Outcome = Structural Component (Predictors) + Residual (Error)

http://xkcd.com/793/

The sordid history of Statistics: Galton and Burt

Sir Francis Galton (1822-1911)

• A strong relationship in which nature dominates: “families of reputation were much more likely than ordinary families to produce offspring of ability”

• Recommended “judicious marriages during several generations” to “produce a highly gifted race of men”

• His “genetic utopia”: “Bright, healthy individuals were treated and paid well, and encouraged to have plenty of children. Social undesirables were treated with reasonable kindness so long as they worked hard and stayed celibate.”

No data for “intelligence” so instead studied HEIGHT

Outsourced modeling to JD Dickson, a Cambridge mathemetician, who formalized linear regression.


• Sir Cyril Burt (1883-1971)• Burt’s father was Galton’s physician• Over a 30 year period, he & two RAs

—Miss Howard and Miss Conway—accrued data on 53 pairs of separated twins

• 15 pairs in 1943• Up to 21 pairs in 1955• Up to 53 pairs in 1966

• “‘Intelligence’, when adequately assessed, is largely dependent on genetic constitution” (Burt, 1966)

Studied heredity by fitting statistical models predicting IQs of identical twins raised in “foster” (adoptive) homes from IQs of siblings raised in biological parents’ homes

© Andrew Ho, Harvard Graduate School of Education 17

IQ scores for Cyril Burt's identical twins reared apartResults of the list command in Stata

Predictor (X):owniq

Outcome (Y):fostiq

RQ: What’s the relationship between the IQ of the child raised in an adoptive home and his/her identical twin raised in the birth home?n = 53

Instinct 1: Get your eyes on (and fingers in) the data.[Also, try the Data Editor (browse) button: ]

© Andrew Ho, Harvard Graduate School of EducationTry also, summarize, detail . Or histogram owniq, frequency kdensity . Or dotplot owniq fostiq .

Univariate summaries of the predictor and the outcome.Results of the summarize, stem, and graph box commands.

Unit 1 / Page 18

Instinct 2: Visualize the data. Univariate summaries, stat.

• What is the unit of analysis? What are we looking at? Persons? Schools? Countries?

• What is the scale? What does the distance between the minimum and the maximum mean?

• What is the central tendency (mean/median)? Is that a reasonable value on this scale?

• What is the standard deviation or interquartile range? Is that a reasonable value on this scale?

• What is the skewness/symmetry? Is the mean much larger than the median (suggests positive skew)?

• Are there many outlying observations (high kurtosis/heavy tails) or surprisingly few (whiskers much shorter than 1.5IQR).

• Remember that there are no hard and fast definitions for positive skew, heavy tails, or outliers. The relevance of these observations must be mediated by your substantive knowledge of the domain and the final inferences you wish to draw from your analysis.

• The smaller the sample, the more anomalies we expect. Don’t obsess over slight skewness and kurtosis when

• Looking at these distributions, they are similar and symmetrical with no obvious outlying univariate observations.

Questions to ask about univariate data graphs and summaries.



We fit models to data; we don’t fit data to models.

Unit 1 / Page 20

Growing accusations• In 1973, Arthur Jensen, a supporter of Burt, noted “misprints and inconsistencies in

some of the data”• In 1974, Leon Kamin noted how odd it was that Burt’s correlation coefficients remained

virtually unchanged as the sample size increased (r=.770, r=.771, and r=.771)• In 1976, a London Sunday Times reporter tried to find the RAs and concluded that they

did not exist• In 1979, The British Journal of Psychology added the following notice to Burt’s 1966

paper: “The attention of readers of the Journal is drawn to the fact that it has now been established that this paper contains spurious data”

• In 1995, an edited volume with 5 essays, Cyril Burt: Fraud or Framed (Oxford), found evidence of sloppy writing, cutting and pasting of text, but perhaps not fraudulent data

• Debate continues to this day—and with Burt long dead, the conclusion may be that we’ll never know

• See also Stephen Jay Gould’s classic treatise against the folly of these early hereditarians: The Mismeasure of Man (1981).


Less “Ideal” Data: The 2011 Box Office

Unit 1 / Page 21

• Instinct 1: Get your eyes on the data• movie11.dta – All 146 movies that achieved wide release in 2011.

grossmil – 2011 Box Office gross by yesterday (in millions of dollars)theaters – Number of theater locations upon wide release (>600)openday – Day of the year in 2011 (1 is January 1, 365 is December 31)openmil – Opening weekend box office gross (in millions of dollars)

Source: http://boxofficemojo.com/

010

020

030

040

0U

S b

ox o

ffic

e g

ross

(in

mill

ions

) by

1/2

5/12

1,0

002,

000

3,0

004,

000

5,0

00N

um

ber

of th

eate

r lo

catio

ns

upo

n op

enin

g

Univariate Summaries and Model Language

© Andrew Ho, Harvard Graduate School of Education Unit 1 / Page 22

010

2030

4050

Fre

que

ncy

0 100 200 300 400US box office gross (in millions) by 1/25/12

010

2030

40F

req

uenc

y

1000 2000 3000 4000Number of theater locations upon opening

Instinct 2: Visualize. Univariate SummariesFigure 1 shows the distribution of US box office gross revenues for the 146 movies that achieved a wide release in 2011. The distribution exhibits a strong positive skew, with a mean of $65.8 million while half of releases earned less than the median of $40.6 million. The two top grossing movies, Harry Potter and Transformers, were well over four standard deviations above the mean.

Figure 2 shows… The distribution has a noticeable negative skew, with a mean of 2800 below the median of 2954. There is the suggestion of bimodality with a second local maximum below 1000. This is consistent with a second category of “limited release” movies restricted from the analysis by variable definition.

*As an aside, don’t sweat kurtosis.

The bivariate relationship between the predictor and the outcomeResults of the graph twoway scatter or, simply, scatter command

Learn the Standard TerminologyA scatterplot of Y vs. X or Y on X.Here, we plot fostiq on owniq.


Instinct 3: Visualize. Bivariate Summaries

Figure 3 represents the bivariate relationship between twin IQs as a scatterplot. The association is strong, positive, and linear, with no noticeable outlying observations. Hm…

010

020

030

040

0U

S b

ox o

ffic

e g

ross

(in

mill

ions

) by

1/2

5/12

1000 2000 3000 4000 5000Number of theater locations upon opening

Less “Ideal” Bivariate Summaries


Instinct 3: Visualize. Bivariate Summaries

Figure 4 shows… There appears to be a strong, nonlinear, positive association between the two variables.

A small number of observations appear to have gross revenues larger than what might be predicted by the number of theater locations, although this conclusion requires further analysis.

Foreshadowing: A log transformation of the grossmil variable is the clear next step.

. scatter grossmil theaters

Questions to ask when examining scatterplots

Direction of relationship?

First, all the same questions we asked for univariate graphs... for each variable. Then…


Linearity of relationship?

Any unusual observations?

Strength of Relationship?

How do we statistically model the bivariate relationship between and ?Use theory and bivariate graphics to decide on the model’s structure: it’s functional form

Simple linear regression: Why are straight lines so popular?

Transformations to achieve linearity In Units 5&10, we’ll learn how to tweak our variables so that we can use straight line machinery to fit curves to data.

A limited range of X may yield linearity Range restrictions are common in social research and can allow for the linear model to fit the data well within an observed range. Appreciating this, we must be cautious of extending predictions beyond the range of observed data.

Actual linearity Many relationships – particularly those in the physical world – are in fact linear:

Mathematical and conceptual simplicityA straight line is among the simplest of mathematical relationships between variables—it makes our work very tractable. Even when data arise from more complex nonlinear processes, the linear model can get us surprisingly accurate predictions.


6080

100

120

140

IQ o

f tw

in r

aise

d in

'fo

ste

r ho

me'

by

adop

tive

pare

nts

.

60 80 100 120 140IQ of twin raised in 'own home' by birth parents.

6080

100

120

140

IQ o

f tw

in r

aise

d in

'fo

ste

r ho

me'

by

adop

tive

pare

nts

.


6080

100

120

140

IQ o

f tw

in r

aise

d in

'fo

ste

r ho

me'

by

adop

tive

pare

nts

.


How do we statistically model the bivariate relationship between and ?Mathematically represent the model’s structure.


Any two points define a line.We use slope-intercept form:

and are our outcome and predictor, respectively, on the vertical and horizontal axes, respectively. is the Greek “b” – beta.

http://www.livingwaterbiblegames.com/greek-alphabet-handwriting.html

020

4060

8010

012

014

0IQ

of t

win

rai

sed

in 'f

ost

er

hom

e' b

y ad

optiv

e pa

ren

ts.

0 20 40 60 80 100 120 140IQ of twin raised in 'own home' by birth parents.

020

4060

8010

012

014

0IQ

of t

win

rai

sed

in 'f

ost

er

hom

e' b

y ad

optiv

e pa

ren

ts.

0 20 40 60 80 100 120 140IQ of twin raised in 'own home' by birth parents.

- The intercept (also, beta-zero, beta-not, the constant term)The predicted/estimated value of when is 0, even if a 0 value for is unobserved, unrealistic, or meaningless.

- The regression coefficient (also, beta-one, the slope)The increment in the predicted/estimated value of for any unit change in .

There is a big oversimplification here (parameters vs. statistics) that we will revisit in a moment!

10ish?

11ish?

rise over run just less than 1ish?

6080

100

120

140

IQ o

f tw

in r

aise

d in

'fo

ste

r ho

me'

by

adop

tive

pare

nts

.


6080

100

120

140

IQ o

f tw

in r

aise

d in

'fo

ste

r ho

me'

by

adop

tive

pare

nts

.


How do we statistically model the bivariate relationship between and ?Formalize the statistical model


As we said earlier, a statistical model has two components, a systematic/structural component, and a residual/error component. We formalize that here:

is the Greek “e” – epsilon.

http://www.livingwaterbiblegames.com/greek-alphabet-handwriting.html

– The residual (also, the error term, “e,”) The difference between the observed (actual) value and the expected

(predicted) value from the systematic/structural component. The “miss.” Reflects the reality that our model is not deterministic.We generally want these values to be as small as possible. They represent a challenge to our structural model: what is left to be

explained.

6080

100

120

140

IQ o

f tw

in r

aise

d in

'fo

ste

r ho

me'

by

adop

tive

pare

nts

.


6080

100

120

140

IQ o

f tw

in r

aise

d in

'fo

ste

r ho

me'

by

adop

tive

pare

nts

.


The Population Regression Equation and Residuals


, a shorthand, for datum

, let’s say…

The “miss,” visualizable.

𝑌 𝑖=10+.9 𝑋 𝑖“Y-hat,” our predicted/estimated value…The Prediction Equation.

Assumes known parameters.

Population Parameters vs. Sample Statistics


The Population. An abstraction to which we generalize.All possible twins, separated at birth, that we could theoretically sample.Generally thought to be infinite.

6080

100

120

140

IQ o

f tw

in r

aise

d in

'fo

ste

r ho

me'

by

adop

tive

pare

nts


The Sample. The data we have sampled, ideally in an unbiased, representative fashion, from the population. Of finite size, in this case,

A parameter is a fact about a population.

Parameters and . Written in Greek. Rarely if never known in practice.

A statistic is a fact about a sample.

Statistics and . Written in Roman or in Greek with hats designating parameter estimates. What we use for inference about the population.


6080

100

120

140

IQ o

f tw

in r

aise

d in

'fo

ste

r ho

me'

by

adop

tive

pare

nts

.


Inference about the population from sample data:How do we fit the hypothesized model to observed data?

Unit 1 / Page 31

How do we get and ?


Understanding the (ordinary) least squares (OLS) criterion

Observed values ()are the sample Foster IQ points

Residuals () are the vertical distances between the observed and predicted values for each datum. Every data point has a residual.

So a “good” line would go through the “center” of the data and have small residuals ();

…perhaps as small as possible???

Predicted values are estimated using the prediction equation,

Ordinary Least Squares (OLS) criterion: Minimize the sum of the squared vertical residuals

The least-squares criterion selects the statistics that make the sum of squared residuals as small as possible (for this particular sample). Obtaining the least-squares regression line for

the sample data is a purely geometrical problem requiring no assumptions.

How do we find the “best-fit” line that has the smallest residuals possible?

http://illuminations.nctm.org/activitydetail.aspx?id=82 Unit 1 / Page 32

∑ (𝑒𝑖 )2=∑ (𝑌 𝑖−𝑌 𝑖)

2=∑ (𝑌 𝑖− ( �̂�0+ �̂� 𝑋 𝑖) )

2

http://illuminations.nctm.org/activitydetail.aspx?id=82


Three “best-fit” regression lines.The line you want, and why.

The OLS criterion minimizes the sum of vertical squared residuals.

Other definitions of “best fit” are possible:

Vertical Squared Residuals (OLS) Horizontal Squared Residuals (X on Y) Orthogonal Residuals

Why does it make substantive sense to minimize vertical residuals?

Because we are interested in trying to predict Y given X.

Unit 1 / Page 33

Midpoint of the vertical slice.


Regression in the Population: The Conditional Distribution

Unit 1 / Page 34

0.005.01.015.02.025Density

50

10

015

0

IQ

o

f tw

in

ra

ised in

'foste

r h

om

e' by ado

ptive p

arents

0.02.04.06Density

60

70

80

90

10

01

10

IQ

o

f tw

in

ra

ised

in

'fo

ste

r h

om

e' b

y ad

op

tive

p

are

nts

0.01.02.03.04Density

90

10

01

10

12

01

30

14

0

IQ

o

f tw

in

ra

ised

in

'foste

r h

om

e' b

y ad

optive p

are

nts

The

unco

nditi

onal

dis

trib

ution

of

The

cond

ition

al d

istr

ibuti

on o

f gi

ven

The

cond

ition

al d

istr

ibuti

on o

f gi

ven

Regression in the sample: Imagining conditional distributions in the populationThe lfit command and layering plots using ()

eXY 10ˆˆ

XY 10ˆˆˆ

From this perspective, we can see that the predicted value, , represented by the regression line, is a conditional mean: an average of a conditional distribution at a particular vertical slice through the data.

We should start to perceive these population conditional distributions, their means, and their variances.

93X 106X

94ˆ Y

106ˆ Y


. . . . .. … .. … … …… …. ….. .. . . . .. . . . . . . . .. … .. … … …… …. ….. .. . . . .. . . .

The mean and standard deviation of conditional distributions


. . . . .. … .. … … …… …. ….. .. . . . .. . . . . . . . .. … .. … … …… …. ….. .. . . . .. . . .

10 20 30 40 50 60 70 80 90You should be able to eyeball the mean and standard deviation of any approximately normal distribution. The mean is the balance point, the central tendency. This is straightforward. The standard deviation is the distance from the mean to the inflection point of the curve. Approximately 2/3 of observations are within ±1 standard deviation of the mean.

For prediction, do we want conditional distributions to have large or small standard deviations?

Population mean: mu, the Greek “m” pronounced “moo”Unconditional population mean: Unconditional sample mean, Conditional population mean: Conditional estimated mean:

Population standard deviation (SD): sigma, the Greek “s” (vs. Sigma: )Unconditional population SD: Unconditional sample SD, Conditional population SD: Conditional estimated SD: RMSE

These four assumptions are often summarized in a shorthand: . This reads, the population residuals, , are independent and identically normally distributed with mean 0 (values centered on the prediction) and a common variance (and thus a common SD).

The linear regression model and its four assumptions

© Andrew Ho, Harvard Graduate School of Education Unit 1 / Page 37

Assumption 1: At each value of there is a conditional distribution of that is normal with mean and SD .

Assumption 2: The straight line model is correct: The fall on a line.

Assumption 3: Homoscedasticity. The conditional standard deviations, , are equal across all .

Assumption 4: Conditional independence. For any value of , the s are independent. They share no hidden common association. Cannot be visualized from the plot.

Intercept parameter estimate:

Results of fitting a least squares regression line to Cyril Burt’s dataThe regress command in Stata.

Interpretation:

Intercept – For a pair of twins where owniq = 0, the predicted value of fostiq is 9.7195.

Slope – A unit increment in owniq is associated with a 0.9079-unit increment in the predicted value

of fostiq.

Mistake: A unit increment in fostiq is associated with…

Double check the number of observations.

Predictor Variable

Slope parameter estimate:


Always takes the form regress Y X

The Estimated Prediction Equation:

Prediction Postestimation

(80, 82.35)

(120, 118.67)

(97.36, 98.11)


The Estimated Prediction Equation:

. predict yhat, xb

After the regress command, you can save the predicted values, , for each :

When

When

When

The regression line will always go through the centroid:

Disciplined perception: four values, three distances.


The story of Mary Kate and Ashley.Four Values

, IQ of Mary Kate in own home., IQ of Ashley in foster home. , predicted IQ of Ashley in foster home., the unconditional mean of foster twins.

𝑋

𝑌

𝑌

𝑌Three Distances

, the Total Deviation, what to “explain”: Ashley above average., the Error/Residual Deviation, what is “unexplained”: Ashley above conditional expectation., the Regression Deviation, what is “explained”: Ashley accounted for.

Total

Error/Residual

Regression

The Analysis of Variance (ANOVA) Decomposition


𝑌

𝑌

𝑌

Total

Error/Residual

Regression

The ANOVA Decomposition of Deviation:Total Deviation = Regression Deviation + Error Deviation

Point to the Mean = Line to the Mean + Point to the LineThis is a tautology, like saying,

It’s more interesting to ask, instead of decomposing the deviations for a single observation, how can we summarize the balance of regression vs. error deviations over all observations?

The ANOVA Decomposition of Variance:Total Variation= Regression Variation + Error Variation

Step 1: Let’s make sure we understand how to compute residuals:Vertical distances between observed values () and fitted values ()

Conclusion: The IQ of the first foster twin is 8.5 points lower than the linear regression model predicts given his/her non-adoptive twin’s IQ.

46.71)68(9079.07195.9ˆ

63 68, :1n Observatio

1

11

Y

YX

Over-predicted

Under-predicted

Positive residuals:

Negative residuals

Sometimes we under-predict, sometimes we over-predict, but across the full sample, the residuals will always sum to 0.

This is pretty nifty, but it also indicates why the sum of residuals is not a good summary of error.

46.846.71631̂1 YY

Second: Calculate the residual using

First: For a particular , compute by substituting into the regression equation


11.98y

22.113ˆ iy

125iy

89.26)( yyi

Deviation Total

Step 2: To what might we reference the size of the residual?

11.98Y

22.113ˆ iY

125iY

11.15)ˆ( YYi

89.26)( YYi

78.11)ˆ( ii YY

(ID 46: owniq=114, fostiq=125)

Two ways the ANOVA decomposition helps us evaluate the quality of the fit

1.Total deviations provide a reference point for evaluating the magnitude of the residuals

2.Regression deviations quantify the accomplishment of the regression model.

Now:

1. Generalize these ideas across cases.

2. Arrive at a meaningful metric.

The mean would be our “best guess” for all values of Y if we had no information from our regression model. It is our reference point.

ANOVA decomposition of deviations

)( YYi )ˆ( ii YY )ˆ( YYi = +

TotalDev

RegrDev

ErrorDev= +

Total Deviation

Error Deviation

Regression Deviation

(point to mean) = (line to mean) + (point to line) Unit 1 / Page 43© Andrew Ho, Harvard Graduate School of Education

Step 3: Summarize across observations to a single descriptive statistic

)(: YYDevTotal i

Point to mean =

)ˆ(: ii YYDevError )ˆ(: YYDevRegr i

Line to mean + Point to line

∑ (�̂� 𝑖−𝑌 )2

∑ (𝑌 𝑖−𝑌 )2 =𝑆𝑆𝑀𝑆𝑆𝑇

=𝑅2

∑ (𝑌 𝑖−𝑌 )2=∑ (�̂� 𝑖−𝑌 )2+∑ (𝑌 𝑖−𝑌 𝑖)2

Sum of Squares Total () = Sum of Squares Model () + Sum of Squares Error ()

The proportion of total variation that is accounted for by the model.

What is …


Analysis of Variance regression decomposition in Stata

)ˆ()ˆ()( iiii YYYYYY Analysis of Variance regression decomposition

2 2 2

12035 2785 ≈ 9251 +

= 76.86%

Interpreting R2

76.9 percent of the variation in the foster

twins’ IQ scores is “attributable to” or

“accounted for by” or “explained by” or

“associated with” or “predicted by” the IQ

of the twin raised in the parental home.

What about the remaining 23.1%?Environment, SES of household, nutrition,measurement error,random error, individual variation, alien abductions…Error is what we haven’t modeled yet.

The Ubiquitous R2

The variance of Y that is accounted for by X…

The single most widespread and easily interpretable summary statistic derivable from a single regression analysis. Essential

to describing the overall predictive function of the model.

Unit 1 / Page 45

SS Total = SS Model + SS Error

One last parameter to estimate: The residual variance, 2Y|X

X

Y

𝑿𝟏 𝑿𝟐… 𝑿𝟑

𝝁𝒀∨𝑿𝟑

𝝁𝒀∨𝑿𝟏

𝝁𝒀∨𝑿𝟐

1. At each value of X, there is a distribution of Y. These distributions have a mean and a variance of

2. Homoscedasticity. The variances of each of these distributions, the are identical

Of what importance is ? (the residual variance of at each value of )

1

)(ˆ

22

n

YYiY

Why do we divide by ?Because we estimated 1 parameter (the mean) to estimate this other parameter (the variance). If we didn’t do this, the sample variance would be a biased estimate of the population variance.

So tells us about the variability of the residuals—the unexplained variability in that’s “left over”

From Variation to VarianceLet’s start by reviewing the sample variance of Y

52

035,12

Now we move from the variance of Y to the conditional variance of Y.

= 231.45

Does this numerator look familiar?This is the Total Deviation, or SS Total


From estimating to estimating

2

)ˆ(ˆ

22|

n

YY iiXY

Take away 2 because we estimated both 0 and 1 to estimate .

60.5451

66.2784

Root Mean Square Error (RMSE)(the estimated standard deviation of the residuals)

Mean Square Error (MSE)

The Underappreciated RMSE

The Estimated Standard Deviation of conditional on ,

Interpretable on the scale of Y, so requires knowledge of the scale to be interpretable. Not distorted by restriction of range. Tends to remind you of just how much variance there is left to explain.


X

Y

𝑿𝟏 𝑿𝟐 … 𝑿𝟑

𝝁𝒀∨𝑿𝟑

𝝁𝒀∨𝑿𝟏

𝝁𝒀∨𝑿𝟐

�̂�𝑎𝑟𝑖𝑎𝑛𝑐𝑒=𝑆𝑢𝑚𝑜𝑓 𝑆𝑞𝑢𝑎𝑟𝑒𝑠

𝑛−¿𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠 𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑒𝑑¿

A loose understanding of degrees of freedom (df)

The simplest answer: That’s what we have to do to get unbiased estimates of the population parameters.

Why do we divide SS Total (Point to Mean) by n-1 to get our estimated variance, but we divide SS Residual (Point to Line) by n-2 to get our estimated conditional variance?

A somewhat less simple answer: We divide by the number of observations (n) minus the number of parameters that we have to estimate in order to calculate our target.

In the case of estimating variance, we need to estimate the mean first (Point to Mean), in order to estimate variance, so we divide by n-1.

In the case of estimating conditional variance, we need to estimate the regression line first (Point to Line), and that has two parameters. So we divide by n-2.

For the real answer, take a matrix algebra course!

In General:If you have a regression equation with k predictors, you will need to estimate k+1

parameters.For example, in simple linear regression, we have one predictor (e.g., owniq),

and we have two parameters we need to estimate (slope and intercept)Thus, for OLS linear regression in general, our degrees of freedom for error will be n-(k+1) =

n-k-1.

Unit 1 Recap: Let’s step back a bit and see what we’ve learned.

• Burt wanted to understand how well the IQ of a separated twin in his or her own home predicted the IQ of a matched twin raised in a foster home.

• We obtained a prediction equation and two statistics that describe how well the prediction equation works: and the MSE or RMSE.

• I would describe this as a strong predictive relationship given the context. A substantial proportion of the variance of foster twins’ IQs can be accounted for by their respective twin raised in his or her own home.

A 1 IQ-point increment in owniq is associated with a 0.91 IQ-point increment in predicted fostiq.

An of .7686: 76.86% of the variance in fostiq is accounted for by the predictor variable, owniq.

An RMSE of 7.39 : conditional on a particular owniq value, actual fostiq values tend to vary around the prediction with a standard deviation of 7.39. The unconditional sd was 15.21.


�̂�𝑜𝑠𝑡𝑖𝑞=9.719+.908𝑜𝑤𝑛𝑖𝑞

Don’t forget your research question… and other research questions of interest

• Apart from prediction for and association between owniq and fostiq, is there another research question that might arise from these data?

• It is worth remembering that regression works perfectly well when and are on different scales. We can use weight to predict height, socioeconomic status to predict test scores, and whether our class meets to predict snowfall.

• However, when X and Y are on the same scale, different research questions can arise…

Regression helps us understand whether there is an association between and , whether we can predict from , and whether can account for ’s variance.

Mean comparisons address whether the means of and are equal or different.

Different research questions; different statistical approaches

Caution: Don’t abuse or overinterpret kdensities! Reserve for large sample sizes or nice distributions.


What are the take-home messages from this unit?

• The regression model represents your hypothesis about the population

– When you fit a regression model to data, you are estimating sample values of population parameters that you’ll never actual obtain directly

– Don’t confuse sample estimates with population parameters—estimates are just estimates, even if they have sound statistical properties

– The regression model is a model for the average of at each given value of : the conditional mean of given . Individual variation figures prominently into the model through the error term (and the residuals it represents)

• Be sure to fully understand the meaning of the regression coefficients

– These are the building blocks for all further data analysis; take the time to make sure you have a complete and intuitive understanding of what they tell us

– Distinguish clearly between the magnitude (the value of in context) and strength of an predictive relationship (high , low RMSE)—don’t confuse these separate concepts.

– The regression approach assumes linearity. We’ll learn in Units 4 and 5 how to evaluate these assumptions and what to do if they don’t hold

• is a nifty summary of how much the regression model helps us

– Be careful about causal language—the phrases “accounted for” or “explained by” do not imply determinism or causation.

– The ANOVA decomposition of variance, which leads to and our estimate of the RMSE (the residual standard deviation) will appear in subsequent calculations; be sure you understand them.