29
Math 15 Introduction to Scientific Data Analysis Lecture 5 Association Statistics & Regression Analysis University of California, Merced

Math 15 Introduction to Scientific Data Analysis Lecture 5 Association Statistics & Regression Analysis University of California, Merced

Embed Size (px)

Citation preview

Page 1: Math 15 Introduction to Scientific Data Analysis Lecture 5 Association Statistics & Regression Analysis University of California, Merced

Math 15Introduction to Scientific Data Analysis

Lecture 5Association Statistics & Regression Analysis

University of California, Merced

Page 2: Math 15 Introduction to Scientific Data Analysis Lecture 5 Association Statistics & Regression Analysis University of California, Merced

Week Date Concepts Project Due

1

2 January 28 Introduction to the data analysis

3 February 4 Excel #1 – General Techniques

4 February 11 Excel #2 – Plotting Graphs/Charts Quiz #1

5 February 18 Holiday

6 February 25 Excel #3 – Statistical Analysis Quiz #2

7 March 3 Excel #4 – Regression Analysis

8 March 10 Excel #5 – Interactive Programming Quiz #3

9 March 17 Introduction to Computer Programming - Part - I

March 24 Spring Recesses

10 March 31 Introduction to Computer Programming - Part - II Project #1

11 April 7 Programming – #1 Quiz #4

12 April 14 Programming – #2

13 April 21 Programming – #3 Quiz #5

14 April 28 Programming – #4

15 May 5 Programming - #5 Quiz #6

16 May 12 Movies / Evaluations Project #2

Final May ??? Final Examination

Course Lecture Schedule

Quiz Next Week!

Page 3: Math 15 Introduction to Scientific Data Analysis Lecture 5 Association Statistics & Regression Analysis University of California, Merced

UC Merced 3

Project #1 – Due March 31st, 2008

Projects can be performed individually or in groups of three, with following rules: Teams turn in one project report and get the same grade. A team consists of at most 3 people—no copying between

teams! Team project report must include a title page, where a team

describe each team member’s contribution. 10% bonus for projects done individually Individual projects must not be copied from anyone else No late project will be accepted!

Project #1 will be posted at UCMCROP by Next Monday!

Page 4: Math 15 Introduction to Scientific Data Analysis Lecture 5 Association Statistics & Regression Analysis University of California, Merced

UC Merced 4

Review:Measures of dispersion or variability

Variance or Standard Deviation The one on the left is more dispersed than the one

on the right. It has a higher variance or standard deviation.

Average

Mode

Page 5: Math 15 Introduction to Scientific Data Analysis Lecture 5 Association Statistics & Regression Analysis University of California, Merced

UC Merced 5

Which is more precise measurement?

Although the standard deviation is a good measure of the precision of a given set of data, it can be difficult to compare the standard deviation from two different types of measurements directly.

You might need to do such a comparison to determine the largest source of uncertainty in an experimentally determined answer

446 35.49

Average

mg ml

s (standard Deviation)= 23

s = 4.5

Page 6: Math 15 Introduction to Scientific Data Analysis Lecture 5 Association Statistics & Regression Analysis University of California, Merced

UC Merced 6

Get the Right Tool for the Job!

Page 7: Math 15 Introduction to Scientific Data Analysis Lecture 5 Association Statistics & Regression Analysis University of California, Merced

UC Merced 7

Measures of dispersion or variability

One way to do this comparison A relative standard deviation, RSD, is simply the ratio of the

standard deviation over the mean

446 35.49

Average

mg ml

s = 23

s = 4.5

xRSD

100

RSD = 100x(23/446) = 5.2

RSD = 100x(4.5/35.49) = 12.7

Page 8: Math 15 Introduction to Scientific Data Analysis Lecture 5 Association Statistics & Regression Analysis University of California, Merced

UC Merced 8

Any Questions?

Page 9: Math 15 Introduction to Scientific Data Analysis Lecture 5 Association Statistics & Regression Analysis University of California, Merced

UC Merced 9

Common Practice for Data Analysis

A common task in data analysis is to investigate an association between two variables. To see if two variables vary together

To see how one variable affect another.

Correlation

Regression

Page 10: Math 15 Introduction to Scientific Data Analysis Lecture 5 Association Statistics & Regression Analysis University of California, Merced

UC Merced 10

Correlation

A correlation tells us whether the two variables vary together. i.e. as one goes up the other goes up

(or goes down)

Correlation Coefficient(Pearson product-moment correlation coefficient or Pearson’s

r)

Correlation Coefficient(Pearson product-moment correlation coefficient or Pearson’s

r)

Page 11: Math 15 Introduction to Scientific Data Analysis Lecture 5 Association Statistics & Regression Analysis University of California, Merced

UC Merced 11

Correlation Coefficient

Vary from +1 (perfect correlation) through 0 (no correlation) to -1 (perfect negative correlation)

sales A

9

12

15

18

0 1 2 3 4 5 6 7

day

sa

les

Series 4

0

10

20

30

0 2 4 6 8 10 12Series 1

0

5

10

15

20

0 2 4 6 8 10 12

1r 1r

0r

Page 12: Math 15 Introduction to Scientific Data Analysis Lecture 5 Association Statistics & Regression Analysis University of California, Merced

UC Merced 12

Correlation Coefficient – cont.

Always draw a diagram to check There are no OUTLIERS. If there are outliers,

the following may not apply. The relation is not curved (r only refers to

LINEAR correlation)

r (approx.

)

strength of tendency

what with what

0.9 to 1 strong high y with high x and low y with low x

0.7 to 0.9 some high y with high x and low y with low x

0.3 to 0.7 little high y with high x and low y with low x

-0.3 to 0.3

none neither high nor low y with high or low x

-0.3 to -0.7

little low y with high x and high y with low x

-0.7 to -0.9

some low y with high x and high y with low x

-0.9 to -1 strong low y with high x and high y with low x

Page 13: Math 15 Introduction to Scientific Data Analysis Lecture 5 Association Statistics & Regression Analysis University of California, Merced

UC Merced 13

Excel Function – Correlation Coefficient

= CORREL(array1,array2)or

= PEARSON(array1,array2)

Positive Correlation

Lengths of a leg bone (in cm) in penguin mating pairs

Page 14: Math 15 Introduction to Scientific Data Analysis Lecture 5 Association Statistics & Regression Analysis University of California, Merced

UC Merced 14

Ice cream sales vs. number of people who drown at sea

Month# of Ice cream sales

(in million)# of

Drowning1 0.30 02 0.20 03 0.90 14 1.50 15 2.00 36 3.50 57 5.50 68 8.00 89 7.50 5

10 2.50 111 0.80 012 0.70 0

Correlation Coefficient 0.927

Page 15: Math 15 Introduction to Scientific Data Analysis Lecture 5 Association Statistics & Regression Analysis University of California, Merced

UC Merced 15

Wait!

What kinds of conclusion can we make from the correlation relationship?

Page 16: Math 15 Introduction to Scientific Data Analysis Lecture 5 Association Statistics & Regression Analysis University of California, Merced

UC Merced 16

Examples

Ice cream sales correlate with the number of people who drown at sea. Therefore, ice cream causes people to

drown.

Since the 1950s, both the atmospheric CO2 level and crime levels have increased sharply. Hence, atmospheric CO2 causes crime.

Not Good Ones!

Page 17: Math 15 Introduction to Scientific Data Analysis Lecture 5 Association Statistics & Regression Analysis University of California, Merced

UC Merced 17

Ice cream sales vs. number of people who drown at sea

Month# of Ice cream sales

(in million)# of

Drowning1 0.30 02 0.20 03 0.90 14 1.50 15 2.00 36 3.50 57 5.50 68 8.00 89 7.50 5

10 2.50 111 0.80 012 0.70 0

Correlation Coefficient 0.927

Page 18: Math 15 Introduction to Scientific Data Analysis Lecture 5 Association Statistics & Regression Analysis University of California, Merced

UC Merced 18

Correlation does not imply causation

There can be no conclusion made regarding the existence or the direction of a cause and effect relationship only from the fact that A is correlated with B. Correlation Coefficient only tells you whether the

two variables vary together.

Determining whether there is an actual cause and effect relationship requires further investigation, even when the relationship between A and B is statistically significant, a large effect size is observed, or a large part of the variance is explained.

Page 19: Math 15 Introduction to Scientific Data Analysis Lecture 5 Association Statistics & Regression Analysis University of California, Merced

UC Merced 19

Any Questions?

Page 20: Math 15 Introduction to Scientific Data Analysis Lecture 5 Association Statistics & Regression Analysis University of California, Merced

UC Merced 20

Regression

Regression is used when we have some reasons to believe that changes in one variable cause changes in the other. Correlation coefficient is not evidence for a causal

relationship.

The simplest kind of causal relationship is a straight-line (or linear) relationship.

Linear regression Linear regression

Page 21: Math 15 Introduction to Scientific Data Analysis Lecture 5 Association Statistics & Regression Analysis University of California, Merced

UC Merced 21

Linear regression

Linear regression assumes a linear relationship between two variables: Dependent factor, y, and independent factor, x.

In a mathematical approach, this relationship can be described by the following linear equation:

where a is called the slope and b is called the intercept. This equation, which allows you to calculate y

(dependent) based on x (independent), is based on the least square method.

baxy

Page 22: Math 15 Introduction to Scientific Data Analysis Lecture 5 Association Statistics & Regression Analysis University of California, Merced

UC Merced 22

Review - Math

Linear Equation Slope and Intercept

y20 y = ax + b

a1

10 a1

b

0 1 2 3 4 x

8

3

y = 3x + 8

Page 23: Math 15 Introduction to Scientific Data Analysis Lecture 5 Association Statistics & Regression Analysis University of California, Merced

UC Merced 23

Slope & Intercept formula

Pair Female Male1 17.1 16.52 18.5 17.43 19.7 17.34 16.2 16.85 21.3 19.56 19.6 18.3

Slope 0.5205Intercept 7.8830

Pair Female Male1 17.1 16.52 18.5 17.43 19.7 17.34 16.2 16.85 21.3 19.56 19.6 18.3

Slope =SLOPE(C2:C7,B2:B7)Intercept =INTERCEPT(C2:C7,B2:B7)

Y-values

X-values

Lengths of a leg bone (in cm) in penguin mating pairs

Page 24: Math 15 Introduction to Scientific Data Analysis Lecture 5 Association Statistics & Regression Analysis University of California, Merced

UC Merced 24

y = ax + b

a – slope & b - intercept

Pair Female Male1 17.1 16.52 18.5 17.43 19.7 17.34 16.2 16.85 21.3 19.56 19.6 18.3

Slope 0.5205Intercept 7.8830

Pair Female Male Predicted Male1 17.1 16.5 16.782 18.5 17.4 17.513 19.7 17.3 18.144 16.2 16.8 16.315 21.3 19.5 18.976 19.6 18.3 18.08

Slope 0.5205Intercept 7.8830

X-values

Predicted Y-values

=$C$10*B3+$C$11

123456789

101112

B C

X-value

Don’t forget $ sign!

Page 25: Math 15 Introduction to Scientific Data Analysis Lecture 5 Association Statistics & Regression Analysis University of California, Merced

UC Merced 25

Plot a linear regression (or trend) line – Part 1

16

17

18

19

20

16 17 18 19 20 21 22

Female size (mm)

Mal

e si

ze (

mm

)

Pair Female Male Predicted Male1 17.1 16.5 16.782 18.5 17.4 17.513 19.7 17.3 18.144 16.2 16.8 16.315 21.3 19.5 18.976 19.6 18.3 18.08

16

17

18

19

20

16 17 18 19 20 21 22

Female size (mm)

Mal

e si

ze (

mm

)

You can add a linear regression line

Page 26: Math 15 Introduction to Scientific Data Analysis Lecture 5 Association Statistics & Regression Analysis University of California, Merced

UC Merced 26

Plot a linear regression (or trend) line –Part 2

Pair Female Male1 17.1 16.52 18.5 17.43 19.7 17.34 16.2 16.85 21.3 19.56 19.6 18.3

16

17

18

19

20

16 17 18 19 20 21 22

Female size (mm)

Mal

e si

ze (

mm

)

Right-click on any data point on the graph Choose Add Trendline Click on Options tab, and select Display

equation and Display R-squared. Click “Ok”

Don’t forget to check these two parts!

Page 27: Math 15 Introduction to Scientific Data Analysis Lecture 5 Association Statistics & Regression Analysis University of California, Merced

UC Merced 27

Plot a linear regression (or trend) line –Part 2 – cont.

R2 Value (R-squared value – RSQ) “measure of scatter”

The closer this value comes to 1, the more accurate the prediction.

y = 0.5205x + 7.883

R2 = 0.7767

16

17

18

19

20

16 17 18 19 20 21 22

Female size (mm)

Mal

e si

ze (

mm

)

Page 28: Math 15 Introduction to Scientific Data Analysis Lecture 5 Association Statistics & Regression Analysis University of California, Merced

UC Merced 28

Let’s review the process!

Pair Female Male1 17.1 16.52 18.5 17.43 19.7 17.34 16.2 16.85 21.3 19.56 19.6 18.3

Lengths of a leg bone (in cm) in penguin mating pairs

16

17

18

19

20

16 17 18 19 20 21 22

Female size (mm)

Mal

e si

ze (

mm

)

If there are some reasons to believe some causalities between two variables, then, plot a graph!

Pair Female Male1 17.1 16.52 18.5 17.43 19.7 17.34 16.2 16.85 21.3 19.56 19.6 18.3

Correlation Coefficient

0.881

y = 0.5205x + 7.883R2 = 0.7767

16

17

18

19

20

16 17 18 19 20 21 22

Female size (mm)

Mal

e si

ze (

mm

) Regression

To see if two variables vary together

To see how one variable affect another.

Page 29: Math 15 Introduction to Scientific Data Analysis Lecture 5 Association Statistics & Regression Analysis University of California, Merced

UC Merced 29

Any Questions?