Vooruitblik 10 en 11 Ma 1 oktober 07. Chapter 10 Correlation and Regression 1. Correlation 2....

Preview:

Citation preview

Vooruitblik 10 en 11

Ma 1 oktober 07

Chapter 10Correlation and Regression

1. Correlation

2. Regression

3. Variation and Prediction Intervals

4. Rangorde correlatie

1. Correlation

• Verband tussen twee gemeten variabelen in een dataset op interval of ratio nivo

• In dit boek: alléén lineaire verbanden

• Let op de voorwaarden!

• Maat: Pearson PM correlatie r of rho

• Geen correlatie: r = 0, maximale correlatie r = -1 of +1

• Kritische waarden: tabel A-6

Scatterplots of Paired Data

Figure 10-2

Scatterplots of Paired Data

Figure 10-2

Formula 10-1

nxy – (x)(y)

n(x2) – (x)2 n(y2) – (y)2r =

The linear correlation coefficient r measures the strength of a linear relationship between the paired values in a sample.

Calculators can compute r

Formula

Figure 10-3

Hypothesis Test for a Linear Correlation

2. Regression

• Vervolg op correlatie

• Berekening van regressielijn in de scatterplot: de lijn die het beste past in de puntenwolk

• Doel: voorspellen van waarden

Regression

The typical equation of a straight line y = mx + b is expressed in the form y = b0 + b1x, where b0 is the y-intercept and b1 is the slope.

^

The regression equation expresses a relationship between x (called the independent variable, predictor variable or explanatory variable), and y (called the dependent variable or response variable).

^

Formulas for b0 and b1

Formula 10-2n(xy) – (x) (y)

b1 = (slope)n(x2) – (x)2

b0 = y – b1 x (y-intercept)Formula 10-3

calculators or computers can compute these values

Given the sample data in Table 10-1, find the regression equation.

Example: Old Faithful - cont

Procedure for Predicting

Figure 10-7

3. Variation and Prediction Intervals

• Vervolg op regressielijn

• (hfst 7) Confidence interval = interval schatting van populatie parameters: proportie, gemiddelde, variantie

• Hier: interval schatting van de schatting van de waarde van een variabele

Key Concept

In this section we proceed to consider a method for constructing a prediction interval, which is an interval estimate of a predicted value of y.

y - E < y < y + E^ ^

Prediction Interval for an Individual y

where

E = t2 se n(x2) – (x)2

n(x0 – x)2

1 + +1n

x0 represents the given value of x

t2 has n – 2 degrees of freedom

Standard Error of Estimate

The standard error of estimate, denoted by se

is a measure of the differences (or distances) between the observed sample y-values and the predicted values y that are obtained using the regression equation.

Definition

^

4. Rangorde correlatie

• Non-parametrische methode = verdelingsvrije toets = geen aannames mbt. Verdeling in de opulatie

• Associatietest op twee variabelen• Spearman’s: rs (sample) of voor populatie: rhos

• Procedure in fig 10.10 (p.537)

voorbeeld

1. Goodness-of-fit: multinominaal

2. Kruistabellen (contingency tables)

3. Variantie analyse (ANOVA)

Chapter 11Multinomial Experiments and Contingency Tables

OverviewWe focus on analysis of categorical (qualitative

or attribute) data that can be separated into different categories (often called cells).

Use the 2 (chi-square) test statistic (Table A- 4).

The goodness-of-fit test uses a one-way frequency table (single row or column).

The contingency table uses a two-way frequency table (two or more rows and columns).

1. Goodness-of-fit: multinominaal

• Komt een feitelijke kansverdeling op een nominale variabele overeen met een verwachte verdeling?

• H0: p1 = x, p2 = y, p3 = z, p4 = etc..

• H1: Tenminste één van de gevonden proporties is afwijkend van de verwachte kans.

Goodness-of-Fit Test in Multinomial Experiments

Critical Values1. Found in Table A- 4 using k – 1 degrees of

freedom, where k = number of categories.

2. Goodness-of-fit hypothesis tests are always right-tailed.

2 = (O – E)2

E

Test Statistics

Example: Last Digit Analysis

Test the claim that the digits in Table 11-2 do not occur with the same frequency.

Relationships Among the 2 Test Statistic, P-Value, and Goodness-of-Fit

Figure 11-3

2. Kruistabellen (contingency tables)

• In this section we consider contingency tables (or two-way frequency tables), which include frequency counts for categorical data arranged in a table with a least two rows and at least two columns.

• We present a method for testing the claim that the row and column variables are independent of each other.

• We will use the same method for a test of homogeneity, whereby we test the claim that different populations have the same proportion of some characteristics.

491

213

704

377

112

489

31

8

39

899

333

1232

Black White Yellow/OrangeRow Totals

Controls (not injured)

Cases (injured or killed)

Column Totals

For the upper left hand cell:

= 513.714E =(899)(704)

1232

Case-Control Study of Motorcycle Drivers

(row total) (column total) E =

(grand total)

899

1232704

899

1232

491513.714

213

704

377

112

489

31

8

39

899

333

1232

Black White Yellow/OrangeRow Totals

Cases (injured or killed)Expected

Column Totals

Controls (not injured)Expected

190.286

356.827

132.173

28.459

10.541

2 2 22 ( ) (491 513.714) (8 10.541)

...513.714 10.541

O E

E

2 8.775

Case-Control Study of Motorcycle Drivers

H0: Row and column variables are independent.

H1: Row and column variables are dependent.

The test statistic is 2 = 8.775

= 0.05

The number of degrees of freedom are

(r–1)(c–1) = (2–1)(3–1) = 2.

The critical value (from Table A-4) is 2.05,2 = 5.991.

Case-Control Study of Motorcycle Drivers

We reject the null hypothesis. It appears there is an association between helmet color and motorcycle safety.

Case-Control Study of Motorcycle Drivers

Figure 11-4

3. Variantie analyse (ANOVA)

• ANalysis Of VAriance

• H0 = meerdere populatie gemiddeldes zijn gelijk

• F-verdeling (tabel A7)

• Toets op P-waarde

TOT SLOT: Bayesiaanse statistiek

• Teksten en 2 opdrachten (worden uitgedeeld)

• 2. Formele benadering• 1. Intuïtieve benadering

Voorbeeldprobleem

• Gegeven: In Orange County VS is 51 % man, 9.5% van de mannen rookt sigaren, tegenover 1.7% van de vrouwen

• Gevraagd: Hoe groot is de kans dat een willekeurige sigarenroker een man is?

1. Intuïtieve benadering

2. Formele benadering

Einde vooruitblik

• Volgende week (week 6): – Vragenuur– Geen nieuwe stof– Voorbereiding proeftentamen

• Week 7: maandag 15 oktober– Vrijdaggroep: bespreking oefeningen in plaats

van vrijdag 12 oktober (ivm. afwezigheid Joris)