144
Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: 1) Construct graphs that appropriately describe data 2) Calculate and interpret numerical summaries of a data set. 3) Combine numerical methods with graphical methods to analyze a data set. 4) Apply graphical methods of summarizing data to choose appropriate numerical summaries. 5) Apply software and/or calculators to automate graphical and numerical summary procedures.

Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Embed Size (px)

Citation preview

Page 1: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Chapter 3Descriptive Statistics Graphical

and Numerical Summaries of DataUNIT OBJECTIVES

At the conclusion of this unit you should be able to 1) Construct graphs that appropriately describe

data 2) Calculate and interpret numerical summaries

of a data set 3) Combine numerical methods with graphical

methods to analyze a data set 4) Apply graphical methods of summarizing data

to choose appropriate numerical summaries 5) Apply software andor calculators to automate

graphical and numerical summary procedures

Section 31Displaying Categorical Data

ldquoSometimes you can see a lot just by lookingrdquo

Yogi Berra

Hall of Fame Catcher NY Yankees

The three rules of data analysis wonrsquot be difficult to remember

1 Make a picture mdashreveals aspects not obvious in the raw data enables you to think clearly about the patterns and relationships that may be hiding in your data

2 Make a picture mdashto show important features of and patterns in the data You may also see things that you did not expect the extraordinary (possibly wrong) data values or unexpected patterns

3 Make a picture mdashthe best way to tell others about your data is with a well-chosen picture

Bar Charts show counts or relative frequency for

each category Example Titanic passengercrew distribution

Titanic Passengers by Class

885

325285

706

000

10000

20000

30000

40000

50000

60000

70000

80000

90000

100000

Crew First Second Third

Pie Charts shows proportions of the

whole in each category Example Titanic passengercrew

distribution Titanic Passengers by Class

Crew40

First15

Second13

Third32

Example Top 10 causes of death in the United States

Rank Causes of death Counts of top 10s

of total deaths

1 Heart disease 700142 37 28

2 Cancer 553768 29 22

3 Cerebrovascular 163538 9 6

4 Chronic respiratory 123013 6 5

5 Accidents 101537 5 4

6 Diabetes mellitus 71372 4 3

7 Flu and pneumonia 62034 3 2

8 Alzheimerrsquos disease 53852 3 2

9 Kidney disorders 39480 2 2

10 Septicemia 32238 2 1

All other causes 629967 25

For each individual who died in the United States we record what was the

cause of death The table above is a summary of that information

0100200300400500600700800

Counts

(x1000)

Top 10 causes of deaths in the United States

Top 10 causes of death bar graphEach category is represented by one bar The barrsquos height shows the count (or

sometimes the percentage) for that particular category

The number of individuals who died of an accident in is approximately 100000

0100200300400500600700800

Counts

(x1000)

Bar graph sorted by rank Easy to analyze

Top 10 causes of deaths in the United States

0100200300400500600700800

Cou

nts

(x10

00)

Sorted alphabetically Much less useful

1 United States $1582 China $6443 Japan $544 Germany $2445 Britain $2356 France $1937 Brazil $1428 Italy $1319 Australia $12810 India $119

1 United States $13792 Japan $2343 Germany $204 Britain $1685 France $1266 Canada $737 Italy $638 China $54 9 Netherlands $5410 Australia $48

Recent Annual Software Sales ($billions)Recent Annual Computer Hardware Sales ($billion)

NY Times

Percent of people dying fromtop 10 causes of death in the United States

Top 10 causes of death pie chartEach slice represents a piece of one whole The size of a slice depends on what

percent of the whole this category represents

Percent of deaths from top 10 causes

Percent of deaths from

all causes

Make sure your labels match

the data

Make sure all percents

add up to 100

Internships

Basic bar chart Side-by-side bar chart

Trend Student Debt by State (grads of public 4 yr or more)

NewHam

pshir

e

Delawar

e

Minn

esot

a

South

Caroli

na

Alabam

a

Illino

is

Mon

tana

NewJe

rsey

India

na

Wes

tVirg

inia

Wisc

onsin

Idah

o

Kansa

s

Arkan

sas

Kentu

cky

Ore

gon

Nebra

ska

Colora

do

North

Caroli

na

Wyo

ming

Was

hingt

on

Florida

NewYor

k

Okla

hom

a

Califo

rnia

0

5000

10000

15000

20000

25000

30000

35000

40000

2009-10 2012-13 National Average2009-10 $216042012-13 $25043

Campbell University IncNew Life Theological Seminary

Meredith CollegeMid-Atlantic Christian University

Wake Forest UniversityMethodist University

Johnson C Smith UniversityChowan University

Catawba CollegeMars Hill College

Elon UniversityWingate University

Lenoir-Rhyne UniversityDavidson College

St Andrews Presbyterian CollegeDuke University

Belmont Abbey CollegeMean North Carolina - 4-year or above

Brevard CollegeWarren Wilson College

Mount Olive CollegeSalem College

Saint Augustines CollegeHigh Point University

0 20000 40000 60000

North Carolina Private Schools

Tuition and fees (in-state) Average debt of graduates

UNC Greensboro

UNC School of the Arts

NC A amp T

Mean North Carolina - 4-year or above

NCSU

UNC-Wilmington

UNC Charlotte

ECU

Appalachian

UNC Asheville

Elizabeth City

0 5000 10000 15000 20000 25000

North Carolina Public Schools

Tuition and fees (in-state) Average debt of graduates

Student Debt North Carolina Schools

Unnecessary dimension in a pie chart

3rd dimension is unnecessary the 3D pie chart does not convey any more information than a 2D pie chart

Section 31 continuedDisplaying Quantitative Data

Histograms

Stem and Leaf Displays

Frequency HistogramsBAKER CITY HOSPITAL - LENGTH OF STAY

DISTRIBUTION

0

10

20

30

40

50

60

70

0lt2 2lt4 4lt6 6lt8 8lt10 10lt12 12lt14 14lt16 16lt18

Relative Frequency Histogram of Exam Grades

005

10

15

20

25

30

40 50 60 70 80 90Grade

Rel

ativ

e fr

eque

ncy

100

Histograms

A histogram shows three general types of information

It provides visual indication of where the approximate center of the data is

We can gain an understanding of the degree of spread or variation in the data

We can observe the shape of the distribution

Histograms Showing Different Centers

0

10

20

30

40

50

60

70

0lt2 2lt4 4lt6 6lt8 8lt10 10lt12 12lt14 14lt16 16lt18

0

10

20

30

40

50

60

70

0lt2 2lt4 4lt6 6lt8 8lt10 10lt12 12lt14 14lt16 16lt18

Histograms - Same Center Different Spread

0

10

20

30

40

50

60

70

0lt2

2lt4

4lt6

6lt8

8lt10

10lt12

12lt14

14lt16

16lt18

0

10

20

30

40

50

60

70

0lt2 2lt4 4lt6 6lt8 8lt10 10lt12 12lt14 14lt16 16lt18

Histograms Shape

A distribution is symmetric if the right and left

sides of the histogram are approximately mirror

images of each other

Symmetric distribution

Complex multimodal distribution

Not all distributions have a simple overall shape

especially when there are few observations

Skewed distribution

A distribution is skewed to the right if the right

side of the histogram (side with larger values)

extends much farther out than the left side It is

skewed to the left if the left side of the histogram

extends much farther out than the right side

Shape (cont)Female heart attack patients in New York state

Age left-skewed Cost right-skewed

Shape (cont) outliersAll 200 m Races 202 secs or less

192 1926193219381944 195 1956196219681974 198 1986199219982004 201 20160

10

20

30

40

50

60

200 m Races 202 secs or less (approx 700)

TIMES

Fre

qu

ency Usain Bolt

2008 1930Michael Johnson1996 1932

Alaska Florida

Shape (cont) Outliers

An important kind of deviation is an outlier Outliers are observations

that lie outside the overall pattern of a distribution Always look for

outliers and try to explain them

The overall pattern is fairly

symmetrical except for 2

states clearly not belonging

to the main trend Alaska

and Florida have unusual

representation of the

elderly in their population

A large gap in the

distribution is typically a

sign of an outlier

Excel Example 2012-13 NFL Salaries

3694

80

1273

609

231

2177

738

462

3081

867

692

3985

996

923

4890

126

154

5794

255

385

6698

384

615

7602

513

846

8506

643

077

9410

772

308

1031

4901

54

1121

9030

77

1212

3160

1302

7289

23

1393

1418

46

1483

5547

69

1573

9676

92

1664

3806

15

1754

7935

38

0

100

200

300

400

500

600

700

800

900

1000

Histogram

Bin

Fre

qu

ency

Statcrunch Example 2012-13 NFL Salaries

Heights of Students in Recent Stats Class (Bimodal)

ExampleGrades on a statistics exam

Data

75 66 77 66 64 73 91 65 59 86 61 86 61

58 70 77 80 58 94 78 62 79 83 54 52 45

82 48 67 55

Example-2Frequency Distribution of Grades

Class Limits Frequency40 up to 50

50 up to 60

60 up to 70

70 up to 80

80 up to 90

90 up to 100

Total

2

6

8

7

5

2

30

Example-3 Relative Frequency Distribution of Grades

Class Limits Relative Frequency40 up to 50

50 up to 60

60 up to 70

70 up to 80

80 up to 90

90 up to 100

230 = 067

630 = 200

830 = 267

730 = 233

530 = 167

230 = 067

Relative Frequency Histogram of Grades

005

10

15

20

25

30

40 50 60 70 80 90Grade

Rel

ativ

e fr

eque

ncy

100

Based on the histo-gram about what percent of the values are between 475 and 525

1 50

2 5

3 17

4 30

Stem and leaf displays Have the following general appearance

stem leaf

1 8 9

2 1 2 8 9 9

3 2 3 8 9

4 0 1

5 6 7

6 4

Example employee ages at a small company

18 21 22 19 32 33 40 41 56 57 64 28 29 29 38 39 stem 10rsquos digit leaf 1rsquos digit

18 stem=1 leaf=8 18 = 1 | 8

stem leaf

1 8 9

2 1 2 8 9 9

3 2 3 8 9

4 0 1

5 6 7

6 4

Suppose a 95 yr old is hiredstem leaf

1 8 9

2 1 2 8 9 9

3 2 3 8 9

4 0 1

5 6 7

6 4

7

8

9 5

Number of TD passes by NFL teams 2012-2013 season(stems are 10rsquos digit)

stem leaf

43

03247

2 6677789

2 01222233444

1 13467889

0 8

Pulse Rates n = 138

Stem Leaves 4 3 4 588 9 5 001233444 10 5 5556788899 23 6 00011111122233333344444 23 6 55556666667777788888888 16 7 00000112222334444 23 7 55555666666777888888999 10 8 0000112224 10 8 5555667789 4 9 0012 2 9 58 4 10 0223 10 1 11 1

AdvantagesDisadvantages of Stem-and-Leaf Displays

Advantages

1) each measurement displayed

2) ascending order in each stem row

3) relatively simple (data set not too large) Disadvantages

display becomes unwieldy for large data sets

Population of 185 US cities with between 100000 and 500000

Multiply stems by 100000

Back-to-back stem-and-leaf displays TD passes by NFL teams 1999-2000 2012-13multiply stems by 10

1999-2000 2012-13

2 4 03

6 3 7

2 3 24

6655 2 6677789

43322221100 2 01222233444

9998887666 1 67889

421 1 134

0 8

Below is a stem-and-leaf display for the pulse rates of 24 women at a health clinic How many pulses are between 67 and 77

Stems are 10rsquos digits

1 4

2 6

3 8

4 10

5 12

Other Graphical Methods for Data Time plots

plot observations in time order time on horizontal axis variable on vertical axis

Time series

measurements are taken at regular intervals (monthly unemployment quarterly GDP weather records electricity demand etc)

Heat maps word walls

Unemployment Rate by Educational Attainment

Water Use During Super Bowl XLV(Packers 31 Steelers 25)

Heat Maps

Word Wall (customer feedback)

Section 32Describing the Center of Data

Mean

Median

2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability (next section)

measures how ldquospread outrdquo the data is

Notation for Data Valuesand Sample Mean

1 2

1 2

3

The sample size is denoted by

For a variable denoted by its observations are denoted by

A common measure of center is the sample mean

The sample mean is denoted by

Shorte

n

n

y y yy

n

y

y y y y

y

n

1 21

1

ned expression for using the symbol

(uppercase Greek letter sigma)n

n

i

i n

i

i

y

y y y

yy

n

y

Simple Example of Sample Mean

Weekly TV viewing time in hours of 7 randomly selected 4th graders

19 40 16 12 10 6 and 97

1

7

1

19 40 16 12 10 6 9 112

11216

7 7

ii

ii

y

yy

Population Mean

1

population

population mea

Denoted by the Greek letter

is the size (for example =34000 for NCSU)

the value of is typically not known

we often use the sample mean

to estimat

n

e the unknown

N

ii

y

N N

y

N

value of

Connection Between Mean and Histogram

A histogram balances when supported at the mean Mean x = 1406

Histogram

0

10

20

30

40

50

60

70

118

5

125

5

132

5

139

5

146

5

153

5

16

05

Mo

re

Absences f rom Work

Fre

qu

en

cy

Frequency

The median anothermeasure of center

Given a set of n data values arranged in order of magnitude

Median= middle value n odd

mean of 2 middle values n even

Ex 2 4 6 8 10 n=5 median=6 Ex 2 4 6 8 n=4 median=(4+6)2=5

Student Pulse Rates (n=62)

38 59 60 60 62 62 63 63 64 64 65 67 68 70 70 70 70 70 70 70 71 71 72 72 73 74 74 75 75 75 75 76 77 77 77 77 78 78 79 79 80 80 80 84 84 85 85 87 90 90 91 92 93 94 94 95 96 96 96 98 98 103

Median = (75+76)2 = 755

The median splits the histogram into 2 halves of equal area

Mean balance pointMedian 50 area each half

mean 5526 years median 577years

Medians are used often

Year 2011 baseball salaries

Median $1450000 (max=$32000000 Alex Rodriguez min=$414000)

Median fan age MLB 45 NFL 43 NBA 41 NHL 39

Median existing home sales price May 2011 $166500 May 2010 $174600

Median household income (2008 dollars) 2009 $50221 2008 $52029

Examples Example n = 7

175 28 32 139 141 253 458 Example n = 7 (ordered) 28 32 139 141 175 253 458 Example n = 8

175 28 32 139 141 253 357 458

Example n =8 (ordered)

28 32 139 141 175 253 357 458

m = 141

m = (141+175)2 = 158

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496049604971524555467586

1 5245

2 49655

3 4960

4 4971

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496052455546497155877586

1 5245

2 49655

3 5546

4 4971

Properties of Mean Median1The mean and median are unique that is a

data set has only 1 mean and 1 median (the mean and median are not necessarily equal)

2The mean uses the value of every number in the data set the median does not

14

20 4 6Ex 2 4 6 8 5 5

4 2

21 4 6Ex 2 4 6 9 5 5

4 2

x m

x m

Example class pulse rates

53 64 67 67 70 76 77 77 78 83 84 85 85 89 90 90 90 90 91 96 98 103 140

23

1

23

844823

location 12th obs 85

ii

n

xx

m m

2010 2014 baseball salaries

2010

n = 845

mean = $3297828

median = $1330000

max = $33000000

2014

n = 848

mean = $3932912

median = $1456250

max = $28000000

>

Disadvantage of the mean

Can be greatly influenced by just a few observations that are much greater or much smaller than the rest of the data

Mean Median Maximum Baseball Salaries 1985 - 201419

85

1987

1989

1991

1993

1995

1997

1999

2001

2003

2005

2007

2009

2011

2013

200000

700000

1200000

1700000

2200000

2700000

3200000

3700000

0

5000000

10000000

15000000

20000000

25000000

30000000

35000000

Baseball Salaries Mean Median and Maximum 1985-2014

Mean Median Maximum

Year

Mea

n M

edia

n S

alar

y

Max

imu

m S

alar

y

Skewness comparing the mean and median

Skewed to the right (positively skewed) meangtmedian

53

490

102 7235 21 26 17 8 10 2 3 1 0 0 1

0

100

200

300

400

500

600

Freq

uenc

y

Salary ($1000s)

2011 Baseball Salaries

Skewed to the left negatively skewed

Mean lt median mean=78 median=87

Histogram of Exam Scores

0

10

20

30

20 30 40 50 60 70 80 90 100Exam Scores

Fre

qu

en

cy

Symmetric data

mean median approx equal

Bank Customers 1000-1100 am

0

5

10

15

20

Number of Customers

Fre

qu

en

cy

Section 33Describing Variability of Data

Standard Deviation

Using the Mean and Standard Deviation Together 68-95-997

Rule (Empirical Rule)

Recall 2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability

measures how ldquospread outrdquo the data is

Ways to measure variability

1 range=largest-smallest

ok sometimes in general too crude sensitive to one large or small obs

1

2 where

the middle is the mean

deviation of from the mean

( ) sum the deviations of all the s from

measure spread from the middle

i i

n

i ii

y

y y y

y y y y

1

( ) 0 always tells us nothingn

ii

y y

Example

1 2

1 2

1 2

1 2

sum of deviations from mean

49 51 50

( ) ( ) (49 50) (51 50) 1 1 0

0 100

Data set 1

Data set 2 50

( ) ( ) (0 50) (100 50) 50 50 0

x x x

x x x x

y y y

y y y y

The Sample Standard Deviation a measure of spread around the mean Square the deviation of each

observation from the mean find the square root of the ldquoaveragerdquo of these squared deviations

2

1

2

2 1

( )sample standard deviation

1

( )is called the sample variance

1

n

ii

n

ii

y ys

n

y ys

n

Calculations hellip

Mean = 634

Sum of squared deviations from mean = 852

(n minus 1) = 13 (n minus 1) is called degrees freedom (df)

s2 = variance = 85213 = 655 square inches

s = standard deviation = radic655 = 256 inches

Women height (inches)i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

2

1

2 )(1

1xx

ns

n

i

1 First calculate the variance s22 Then take the square root to get the

standard deviation s

2

1

)(1

1xx

ns

n

i

Meanplusmn 1 sd

Wersquoll never calculate these by hand so make sure to know how to get the standard deviation using your calculator Excel or other software

Population Standard Deviation

2

1

Denoted by the lower case Greek letter

is the size (for example =34000 for NCSU)

is the mean

( )population standard deviation

va

po

lue of typically not known

us

pulation

populatio

e

n

N

ii

N N

y

N

s

to estimate value of

Remarks

1 The standard deviation of a set of measurements is an estimate of the likely size of the chance error in a single measurement

Remarks (cont)

2 Note that s and s are always greater than or equal to zero

3 The larger the value of s (or s ) the greater the spread of the data

When does s=0 When does s =0

When all data values are the same

Remarks (cont)4 The standard deviation is the most

commonly used measure of risk in finance and businessndash Stocks Mutual Funds etc

5 Variance s2 sample variance 2 population variance Units are squared units of the original data square $ square gallons

Review Properties of s and s s and s are always greater than or

equal to 0

when does s = 0 s = 0 The larger the value of s (or s) the

greater the spread of the data the standard deviation of a set of

measurements is an estimate of the likely size of the chance error in a single measurement

Summary of Notation

2

SAMPLE

sample mean

sample median

sample variance

sample stand dev

y

m

s

s

2

POPULATION

population mean

population median

population variance

population stand dev

m

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

68-95-997 rule

Mean andStandard Deviation

(numerical)

Histogram(graphical)

68-95-997 rule

The 68-95-997 ruleIf the histogram of the data is

approximately bell-shaped then1) approximately of the measurements

are of the mean

that is in ( )

2) approximately of the measurement

68

within 1 standard deviation

95

within 2 standard deviation

s

are of the meas n

that is

y s y s

almost all

within 3 standard deviation

in ( 2 2 )

3) the measurements

are of the mean

that is in ( 3 3 )

s

y s y s

y s y s

68-95-997 rule 68 within 1 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

68

3434

y-s y y+s

68-95-997 rule 95 within 2 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

95

475 475

y-2s y y+2s

Example textbook costs

37548

4272

50

y

s

n

286 291 307 308 315 316 327328 340 342 346 347 348 348 349354 355 355 360 361 364 367 369371 373 377 380 381 382 385 385387 390 390 397 398 409 409 410418 422 424 425 426 428 433 434437 440 480

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( ) (33276 41820)

32percentage of data values in this interval 64

5068-95-997 rule 68

y s

y s y s

1 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 2 2 ) (29004 46092)

48percentage of data values in this interval 96

5068-95-997 rule 95

y s

y s y s

2 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 3 3 ) (24732 50364)

50percentage of data values in this interval 100

5068-95-997 rule 997

y s

y s y s

3 standard deviation interval about the mean

The best estimate of the standard deviation of the menrsquos weights

displayed in this dotplot is

1 10

2 15

3 20

4 40

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

Preceding slides Next

Z-scores Standardized Data Values

Measures the distance of a number from the mean in units of

the standard deviation

z-score corresponding to y

where

original data value

the sample mean

s the sample standard deviation

the z-score corresponding to

y yz

s

y

y

z y

Exam 1 y1 = 88 s1 = 6 your exam 1 score 91

Exam 2 y2 = 88 s2 = 10 your exam 2 score 92

Which score is better

1

2

91 88 3z 5

6 692 88 4

z 410 10

91 on exam 1 is better than 92 on exam 2

If data has mean and standard deviation

then standardizing a particular value of

indicates how many standard deviations

is above or below the mean

y s

y

y

y

Comparing SAT and ACT Scores

SAT Math Eleanorrsquos score 680

SAT mean =500 sd=100 ACT Math Geraldrsquos score 27

ACT mean=18 sd=6 Eleanorrsquos z-score z=(680-500)100=18 Geraldrsquos z-score z=(27-18)6=15 Eleanorrsquos score is better

Z-scores add to zeroStudentInstitutional Support to Athletic Depts For the 9 Public ACC

Schools 2013 ($ millions)

School Support y - ybar Z-score

Maryland 155 64 179

UVA 131 40 112

Louisville 109 18 050

UNC 92 01 003

VaTech 79 -12 -034

FSU 79 -12 -034

GaTech 71 -20 -056

NCSU 65 -26 -073

Clemson 38 -53 -147

Mean=91000 s=35697

Sum = 0 Sum = 0

Recently the mean tuition at 4-yr public collegesuniversities in the US was $6185 with a standard deviation of $1804 In NC the mean tuition was $4320 What is NCrsquos z-score

1 103

2 -103

3 239

4 1865

5 -1865

Section 34Measures of Position (also called Measures of Relative Standing)

Quartiles

5-Number Summary

Interquartile Range Another Measure of Spread

Boxplots

m = median = 34

Q1= first quartile = 23

Q3= third quartile = 42

1 1 062 2 123 3 164 4 195 5 156 6 217 7 238 6 239 5 2510 4 2811 3 2912 2 3313 1 3414 2 3615 3 3716 4 3817 5 3918 6 4119 7 4220 6 4521 5 4722 4 4923 3 5324 2 5625 1 61

Quartiles Measuring spread by examining the middleThe first quartile Q1 is the value in the

sample that has 25 of the data at or

below it (Q1 is the median of the lower

half of the sorted data)

The third quartile Q3 is the value in the

sample that has 75 of the data at or

below it (Q3 is the median of the upper

half of the sorted data)

Quartiles and median divide data into 4 pieces

Q1 M Q3

14 14 14 14

Quartiles are common measures of spread

httpoirpncsueduiradmit

httpoirpncsueduunivpeer

University of Southern California

Economic Value of College Majors

Rules for Calculating QuartilesStep 1 find the median of all the data (the median divides the data in half)

Step 2a find the median of the lower half this median is Q1Step 2b find the median of the upper half this median is Q3

Importantwhen n is odd include the overall median in both halveswhen n is even do not include the overall median in either half

Example 2 4 6 8 10 12 14 16 18 20 n = 10

Median m = (10+12)2 = 222 = 11

Q1 median of lower half 2 4 6 8 10

Q1 = 6

Q3 median of upper half 12 14 16 18 20

Q3 = 16

11

Pulse Rates n = 138

Stem Leaves4

3 4 5889 5 00123344410 5 555678889923 6 0001111112223333334444423 6 5555666666777778888888816 7 0000011222233444423 7 5555566666677788888899910 8 000011222410 8 55556677894 9 00122 9 584 10 0223

101 11 1

Median mean of pulses in locations 69 amp 70 median= (70+70)2=70

Q1 median of lower half (lower half = 69 smallest pulses) Q1 = pulse in ordered position 35Q1 = 63

Q3 median of upper half (upper half = 69 largest pulses) Q3= pulse in position 35 from the high end Q3=78

Below are the weights of 31 linemen on the NCSU football team What is the

value of the first quartile Q1

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 287

2 2575

3 2635

4 2625

Interquartile range another measure of spread

lower quartile Q1

middle quartile median upper quartile Q3

interquartile range (IQR)

IQR = Q3 ndash Q1

measures spread of middle 50 of the data

Example beginning pulse rates

Q3 = 78 Q1 = 63

IQR = 78 ndash 63 = 15

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 2: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Section 31Displaying Categorical Data

ldquoSometimes you can see a lot just by lookingrdquo

Yogi Berra

Hall of Fame Catcher NY Yankees

The three rules of data analysis wonrsquot be difficult to remember

1 Make a picture mdashreveals aspects not obvious in the raw data enables you to think clearly about the patterns and relationships that may be hiding in your data

2 Make a picture mdashto show important features of and patterns in the data You may also see things that you did not expect the extraordinary (possibly wrong) data values or unexpected patterns

3 Make a picture mdashthe best way to tell others about your data is with a well-chosen picture

Bar Charts show counts or relative frequency for

each category Example Titanic passengercrew distribution

Titanic Passengers by Class

885

325285

706

000

10000

20000

30000

40000

50000

60000

70000

80000

90000

100000

Crew First Second Third

Pie Charts shows proportions of the

whole in each category Example Titanic passengercrew

distribution Titanic Passengers by Class

Crew40

First15

Second13

Third32

Example Top 10 causes of death in the United States

Rank Causes of death Counts of top 10s

of total deaths

1 Heart disease 700142 37 28

2 Cancer 553768 29 22

3 Cerebrovascular 163538 9 6

4 Chronic respiratory 123013 6 5

5 Accidents 101537 5 4

6 Diabetes mellitus 71372 4 3

7 Flu and pneumonia 62034 3 2

8 Alzheimerrsquos disease 53852 3 2

9 Kidney disorders 39480 2 2

10 Septicemia 32238 2 1

All other causes 629967 25

For each individual who died in the United States we record what was the

cause of death The table above is a summary of that information

0100200300400500600700800

Counts

(x1000)

Top 10 causes of deaths in the United States

Top 10 causes of death bar graphEach category is represented by one bar The barrsquos height shows the count (or

sometimes the percentage) for that particular category

The number of individuals who died of an accident in is approximately 100000

0100200300400500600700800

Counts

(x1000)

Bar graph sorted by rank Easy to analyze

Top 10 causes of deaths in the United States

0100200300400500600700800

Cou

nts

(x10

00)

Sorted alphabetically Much less useful

1 United States $1582 China $6443 Japan $544 Germany $2445 Britain $2356 France $1937 Brazil $1428 Italy $1319 Australia $12810 India $119

1 United States $13792 Japan $2343 Germany $204 Britain $1685 France $1266 Canada $737 Italy $638 China $54 9 Netherlands $5410 Australia $48

Recent Annual Software Sales ($billions)Recent Annual Computer Hardware Sales ($billion)

NY Times

Percent of people dying fromtop 10 causes of death in the United States

Top 10 causes of death pie chartEach slice represents a piece of one whole The size of a slice depends on what

percent of the whole this category represents

Percent of deaths from top 10 causes

Percent of deaths from

all causes

Make sure your labels match

the data

Make sure all percents

add up to 100

Internships

Basic bar chart Side-by-side bar chart

Trend Student Debt by State (grads of public 4 yr or more)

NewHam

pshir

e

Delawar

e

Minn

esot

a

South

Caroli

na

Alabam

a

Illino

is

Mon

tana

NewJe

rsey

India

na

Wes

tVirg

inia

Wisc

onsin

Idah

o

Kansa

s

Arkan

sas

Kentu

cky

Ore

gon

Nebra

ska

Colora

do

North

Caroli

na

Wyo

ming

Was

hingt

on

Florida

NewYor

k

Okla

hom

a

Califo

rnia

0

5000

10000

15000

20000

25000

30000

35000

40000

2009-10 2012-13 National Average2009-10 $216042012-13 $25043

Campbell University IncNew Life Theological Seminary

Meredith CollegeMid-Atlantic Christian University

Wake Forest UniversityMethodist University

Johnson C Smith UniversityChowan University

Catawba CollegeMars Hill College

Elon UniversityWingate University

Lenoir-Rhyne UniversityDavidson College

St Andrews Presbyterian CollegeDuke University

Belmont Abbey CollegeMean North Carolina - 4-year or above

Brevard CollegeWarren Wilson College

Mount Olive CollegeSalem College

Saint Augustines CollegeHigh Point University

0 20000 40000 60000

North Carolina Private Schools

Tuition and fees (in-state) Average debt of graduates

UNC Greensboro

UNC School of the Arts

NC A amp T

Mean North Carolina - 4-year or above

NCSU

UNC-Wilmington

UNC Charlotte

ECU

Appalachian

UNC Asheville

Elizabeth City

0 5000 10000 15000 20000 25000

North Carolina Public Schools

Tuition and fees (in-state) Average debt of graduates

Student Debt North Carolina Schools

Unnecessary dimension in a pie chart

3rd dimension is unnecessary the 3D pie chart does not convey any more information than a 2D pie chart

Section 31 continuedDisplaying Quantitative Data

Histograms

Stem and Leaf Displays

Frequency HistogramsBAKER CITY HOSPITAL - LENGTH OF STAY

DISTRIBUTION

0

10

20

30

40

50

60

70

0lt2 2lt4 4lt6 6lt8 8lt10 10lt12 12lt14 14lt16 16lt18

Relative Frequency Histogram of Exam Grades

005

10

15

20

25

30

40 50 60 70 80 90Grade

Rel

ativ

e fr

eque

ncy

100

Histograms

A histogram shows three general types of information

It provides visual indication of where the approximate center of the data is

We can gain an understanding of the degree of spread or variation in the data

We can observe the shape of the distribution

Histograms Showing Different Centers

0

10

20

30

40

50

60

70

0lt2 2lt4 4lt6 6lt8 8lt10 10lt12 12lt14 14lt16 16lt18

0

10

20

30

40

50

60

70

0lt2 2lt4 4lt6 6lt8 8lt10 10lt12 12lt14 14lt16 16lt18

Histograms - Same Center Different Spread

0

10

20

30

40

50

60

70

0lt2

2lt4

4lt6

6lt8

8lt10

10lt12

12lt14

14lt16

16lt18

0

10

20

30

40

50

60

70

0lt2 2lt4 4lt6 6lt8 8lt10 10lt12 12lt14 14lt16 16lt18

Histograms Shape

A distribution is symmetric if the right and left

sides of the histogram are approximately mirror

images of each other

Symmetric distribution

Complex multimodal distribution

Not all distributions have a simple overall shape

especially when there are few observations

Skewed distribution

A distribution is skewed to the right if the right

side of the histogram (side with larger values)

extends much farther out than the left side It is

skewed to the left if the left side of the histogram

extends much farther out than the right side

Shape (cont)Female heart attack patients in New York state

Age left-skewed Cost right-skewed

Shape (cont) outliersAll 200 m Races 202 secs or less

192 1926193219381944 195 1956196219681974 198 1986199219982004 201 20160

10

20

30

40

50

60

200 m Races 202 secs or less (approx 700)

TIMES

Fre

qu

ency Usain Bolt

2008 1930Michael Johnson1996 1932

Alaska Florida

Shape (cont) Outliers

An important kind of deviation is an outlier Outliers are observations

that lie outside the overall pattern of a distribution Always look for

outliers and try to explain them

The overall pattern is fairly

symmetrical except for 2

states clearly not belonging

to the main trend Alaska

and Florida have unusual

representation of the

elderly in their population

A large gap in the

distribution is typically a

sign of an outlier

Excel Example 2012-13 NFL Salaries

3694

80

1273

609

231

2177

738

462

3081

867

692

3985

996

923

4890

126

154

5794

255

385

6698

384

615

7602

513

846

8506

643

077

9410

772

308

1031

4901

54

1121

9030

77

1212

3160

1302

7289

23

1393

1418

46

1483

5547

69

1573

9676

92

1664

3806

15

1754

7935

38

0

100

200

300

400

500

600

700

800

900

1000

Histogram

Bin

Fre

qu

ency

Statcrunch Example 2012-13 NFL Salaries

Heights of Students in Recent Stats Class (Bimodal)

ExampleGrades on a statistics exam

Data

75 66 77 66 64 73 91 65 59 86 61 86 61

58 70 77 80 58 94 78 62 79 83 54 52 45

82 48 67 55

Example-2Frequency Distribution of Grades

Class Limits Frequency40 up to 50

50 up to 60

60 up to 70

70 up to 80

80 up to 90

90 up to 100

Total

2

6

8

7

5

2

30

Example-3 Relative Frequency Distribution of Grades

Class Limits Relative Frequency40 up to 50

50 up to 60

60 up to 70

70 up to 80

80 up to 90

90 up to 100

230 = 067

630 = 200

830 = 267

730 = 233

530 = 167

230 = 067

Relative Frequency Histogram of Grades

005

10

15

20

25

30

40 50 60 70 80 90Grade

Rel

ativ

e fr

eque

ncy

100

Based on the histo-gram about what percent of the values are between 475 and 525

1 50

2 5

3 17

4 30

Stem and leaf displays Have the following general appearance

stem leaf

1 8 9

2 1 2 8 9 9

3 2 3 8 9

4 0 1

5 6 7

6 4

Example employee ages at a small company

18 21 22 19 32 33 40 41 56 57 64 28 29 29 38 39 stem 10rsquos digit leaf 1rsquos digit

18 stem=1 leaf=8 18 = 1 | 8

stem leaf

1 8 9

2 1 2 8 9 9

3 2 3 8 9

4 0 1

5 6 7

6 4

Suppose a 95 yr old is hiredstem leaf

1 8 9

2 1 2 8 9 9

3 2 3 8 9

4 0 1

5 6 7

6 4

7

8

9 5

Number of TD passes by NFL teams 2012-2013 season(stems are 10rsquos digit)

stem leaf

43

03247

2 6677789

2 01222233444

1 13467889

0 8

Pulse Rates n = 138

Stem Leaves 4 3 4 588 9 5 001233444 10 5 5556788899 23 6 00011111122233333344444 23 6 55556666667777788888888 16 7 00000112222334444 23 7 55555666666777888888999 10 8 0000112224 10 8 5555667789 4 9 0012 2 9 58 4 10 0223 10 1 11 1

AdvantagesDisadvantages of Stem-and-Leaf Displays

Advantages

1) each measurement displayed

2) ascending order in each stem row

3) relatively simple (data set not too large) Disadvantages

display becomes unwieldy for large data sets

Population of 185 US cities with between 100000 and 500000

Multiply stems by 100000

Back-to-back stem-and-leaf displays TD passes by NFL teams 1999-2000 2012-13multiply stems by 10

1999-2000 2012-13

2 4 03

6 3 7

2 3 24

6655 2 6677789

43322221100 2 01222233444

9998887666 1 67889

421 1 134

0 8

Below is a stem-and-leaf display for the pulse rates of 24 women at a health clinic How many pulses are between 67 and 77

Stems are 10rsquos digits

1 4

2 6

3 8

4 10

5 12

Other Graphical Methods for Data Time plots

plot observations in time order time on horizontal axis variable on vertical axis

Time series

measurements are taken at regular intervals (monthly unemployment quarterly GDP weather records electricity demand etc)

Heat maps word walls

Unemployment Rate by Educational Attainment

Water Use During Super Bowl XLV(Packers 31 Steelers 25)

Heat Maps

Word Wall (customer feedback)

Section 32Describing the Center of Data

Mean

Median

2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability (next section)

measures how ldquospread outrdquo the data is

Notation for Data Valuesand Sample Mean

1 2

1 2

3

The sample size is denoted by

For a variable denoted by its observations are denoted by

A common measure of center is the sample mean

The sample mean is denoted by

Shorte

n

n

y y yy

n

y

y y y y

y

n

1 21

1

ned expression for using the symbol

(uppercase Greek letter sigma)n

n

i

i n

i

i

y

y y y

yy

n

y

Simple Example of Sample Mean

Weekly TV viewing time in hours of 7 randomly selected 4th graders

19 40 16 12 10 6 and 97

1

7

1

19 40 16 12 10 6 9 112

11216

7 7

ii

ii

y

yy

Population Mean

1

population

population mea

Denoted by the Greek letter

is the size (for example =34000 for NCSU)

the value of is typically not known

we often use the sample mean

to estimat

n

e the unknown

N

ii

y

N N

y

N

value of

Connection Between Mean and Histogram

A histogram balances when supported at the mean Mean x = 1406

Histogram

0

10

20

30

40

50

60

70

118

5

125

5

132

5

139

5

146

5

153

5

16

05

Mo

re

Absences f rom Work

Fre

qu

en

cy

Frequency

The median anothermeasure of center

Given a set of n data values arranged in order of magnitude

Median= middle value n odd

mean of 2 middle values n even

Ex 2 4 6 8 10 n=5 median=6 Ex 2 4 6 8 n=4 median=(4+6)2=5

Student Pulse Rates (n=62)

38 59 60 60 62 62 63 63 64 64 65 67 68 70 70 70 70 70 70 70 71 71 72 72 73 74 74 75 75 75 75 76 77 77 77 77 78 78 79 79 80 80 80 84 84 85 85 87 90 90 91 92 93 94 94 95 96 96 96 98 98 103

Median = (75+76)2 = 755

The median splits the histogram into 2 halves of equal area

Mean balance pointMedian 50 area each half

mean 5526 years median 577years

Medians are used often

Year 2011 baseball salaries

Median $1450000 (max=$32000000 Alex Rodriguez min=$414000)

Median fan age MLB 45 NFL 43 NBA 41 NHL 39

Median existing home sales price May 2011 $166500 May 2010 $174600

Median household income (2008 dollars) 2009 $50221 2008 $52029

Examples Example n = 7

175 28 32 139 141 253 458 Example n = 7 (ordered) 28 32 139 141 175 253 458 Example n = 8

175 28 32 139 141 253 357 458

Example n =8 (ordered)

28 32 139 141 175 253 357 458

m = 141

m = (141+175)2 = 158

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496049604971524555467586

1 5245

2 49655

3 4960

4 4971

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496052455546497155877586

1 5245

2 49655

3 5546

4 4971

Properties of Mean Median1The mean and median are unique that is a

data set has only 1 mean and 1 median (the mean and median are not necessarily equal)

2The mean uses the value of every number in the data set the median does not

14

20 4 6Ex 2 4 6 8 5 5

4 2

21 4 6Ex 2 4 6 9 5 5

4 2

x m

x m

Example class pulse rates

53 64 67 67 70 76 77 77 78 83 84 85 85 89 90 90 90 90 91 96 98 103 140

23

1

23

844823

location 12th obs 85

ii

n

xx

m m

2010 2014 baseball salaries

2010

n = 845

mean = $3297828

median = $1330000

max = $33000000

2014

n = 848

mean = $3932912

median = $1456250

max = $28000000

>

Disadvantage of the mean

Can be greatly influenced by just a few observations that are much greater or much smaller than the rest of the data

Mean Median Maximum Baseball Salaries 1985 - 201419

85

1987

1989

1991

1993

1995

1997

1999

2001

2003

2005

2007

2009

2011

2013

200000

700000

1200000

1700000

2200000

2700000

3200000

3700000

0

5000000

10000000

15000000

20000000

25000000

30000000

35000000

Baseball Salaries Mean Median and Maximum 1985-2014

Mean Median Maximum

Year

Mea

n M

edia

n S

alar

y

Max

imu

m S

alar

y

Skewness comparing the mean and median

Skewed to the right (positively skewed) meangtmedian

53

490

102 7235 21 26 17 8 10 2 3 1 0 0 1

0

100

200

300

400

500

600

Freq

uenc

y

Salary ($1000s)

2011 Baseball Salaries

Skewed to the left negatively skewed

Mean lt median mean=78 median=87

Histogram of Exam Scores

0

10

20

30

20 30 40 50 60 70 80 90 100Exam Scores

Fre

qu

en

cy

Symmetric data

mean median approx equal

Bank Customers 1000-1100 am

0

5

10

15

20

Number of Customers

Fre

qu

en

cy

Section 33Describing Variability of Data

Standard Deviation

Using the Mean and Standard Deviation Together 68-95-997

Rule (Empirical Rule)

Recall 2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability

measures how ldquospread outrdquo the data is

Ways to measure variability

1 range=largest-smallest

ok sometimes in general too crude sensitive to one large or small obs

1

2 where

the middle is the mean

deviation of from the mean

( ) sum the deviations of all the s from

measure spread from the middle

i i

n

i ii

y

y y y

y y y y

1

( ) 0 always tells us nothingn

ii

y y

Example

1 2

1 2

1 2

1 2

sum of deviations from mean

49 51 50

( ) ( ) (49 50) (51 50) 1 1 0

0 100

Data set 1

Data set 2 50

( ) ( ) (0 50) (100 50) 50 50 0

x x x

x x x x

y y y

y y y y

The Sample Standard Deviation a measure of spread around the mean Square the deviation of each

observation from the mean find the square root of the ldquoaveragerdquo of these squared deviations

2

1

2

2 1

( )sample standard deviation

1

( )is called the sample variance

1

n

ii

n

ii

y ys

n

y ys

n

Calculations hellip

Mean = 634

Sum of squared deviations from mean = 852

(n minus 1) = 13 (n minus 1) is called degrees freedom (df)

s2 = variance = 85213 = 655 square inches

s = standard deviation = radic655 = 256 inches

Women height (inches)i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

2

1

2 )(1

1xx

ns

n

i

1 First calculate the variance s22 Then take the square root to get the

standard deviation s

2

1

)(1

1xx

ns

n

i

Meanplusmn 1 sd

Wersquoll never calculate these by hand so make sure to know how to get the standard deviation using your calculator Excel or other software

Population Standard Deviation

2

1

Denoted by the lower case Greek letter

is the size (for example =34000 for NCSU)

is the mean

( )population standard deviation

va

po

lue of typically not known

us

pulation

populatio

e

n

N

ii

N N

y

N

s

to estimate value of

Remarks

1 The standard deviation of a set of measurements is an estimate of the likely size of the chance error in a single measurement

Remarks (cont)

2 Note that s and s are always greater than or equal to zero

3 The larger the value of s (or s ) the greater the spread of the data

When does s=0 When does s =0

When all data values are the same

Remarks (cont)4 The standard deviation is the most

commonly used measure of risk in finance and businessndash Stocks Mutual Funds etc

5 Variance s2 sample variance 2 population variance Units are squared units of the original data square $ square gallons

Review Properties of s and s s and s are always greater than or

equal to 0

when does s = 0 s = 0 The larger the value of s (or s) the

greater the spread of the data the standard deviation of a set of

measurements is an estimate of the likely size of the chance error in a single measurement

Summary of Notation

2

SAMPLE

sample mean

sample median

sample variance

sample stand dev

y

m

s

s

2

POPULATION

population mean

population median

population variance

population stand dev

m

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

68-95-997 rule

Mean andStandard Deviation

(numerical)

Histogram(graphical)

68-95-997 rule

The 68-95-997 ruleIf the histogram of the data is

approximately bell-shaped then1) approximately of the measurements

are of the mean

that is in ( )

2) approximately of the measurement

68

within 1 standard deviation

95

within 2 standard deviation

s

are of the meas n

that is

y s y s

almost all

within 3 standard deviation

in ( 2 2 )

3) the measurements

are of the mean

that is in ( 3 3 )

s

y s y s

y s y s

68-95-997 rule 68 within 1 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

68

3434

y-s y y+s

68-95-997 rule 95 within 2 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

95

475 475

y-2s y y+2s

Example textbook costs

37548

4272

50

y

s

n

286 291 307 308 315 316 327328 340 342 346 347 348 348 349354 355 355 360 361 364 367 369371 373 377 380 381 382 385 385387 390 390 397 398 409 409 410418 422 424 425 426 428 433 434437 440 480

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( ) (33276 41820)

32percentage of data values in this interval 64

5068-95-997 rule 68

y s

y s y s

1 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 2 2 ) (29004 46092)

48percentage of data values in this interval 96

5068-95-997 rule 95

y s

y s y s

2 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 3 3 ) (24732 50364)

50percentage of data values in this interval 100

5068-95-997 rule 997

y s

y s y s

3 standard deviation interval about the mean

The best estimate of the standard deviation of the menrsquos weights

displayed in this dotplot is

1 10

2 15

3 20

4 40

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

Preceding slides Next

Z-scores Standardized Data Values

Measures the distance of a number from the mean in units of

the standard deviation

z-score corresponding to y

where

original data value

the sample mean

s the sample standard deviation

the z-score corresponding to

y yz

s

y

y

z y

Exam 1 y1 = 88 s1 = 6 your exam 1 score 91

Exam 2 y2 = 88 s2 = 10 your exam 2 score 92

Which score is better

1

2

91 88 3z 5

6 692 88 4

z 410 10

91 on exam 1 is better than 92 on exam 2

If data has mean and standard deviation

then standardizing a particular value of

indicates how many standard deviations

is above or below the mean

y s

y

y

y

Comparing SAT and ACT Scores

SAT Math Eleanorrsquos score 680

SAT mean =500 sd=100 ACT Math Geraldrsquos score 27

ACT mean=18 sd=6 Eleanorrsquos z-score z=(680-500)100=18 Geraldrsquos z-score z=(27-18)6=15 Eleanorrsquos score is better

Z-scores add to zeroStudentInstitutional Support to Athletic Depts For the 9 Public ACC

Schools 2013 ($ millions)

School Support y - ybar Z-score

Maryland 155 64 179

UVA 131 40 112

Louisville 109 18 050

UNC 92 01 003

VaTech 79 -12 -034

FSU 79 -12 -034

GaTech 71 -20 -056

NCSU 65 -26 -073

Clemson 38 -53 -147

Mean=91000 s=35697

Sum = 0 Sum = 0

Recently the mean tuition at 4-yr public collegesuniversities in the US was $6185 with a standard deviation of $1804 In NC the mean tuition was $4320 What is NCrsquos z-score

1 103

2 -103

3 239

4 1865

5 -1865

Section 34Measures of Position (also called Measures of Relative Standing)

Quartiles

5-Number Summary

Interquartile Range Another Measure of Spread

Boxplots

m = median = 34

Q1= first quartile = 23

Q3= third quartile = 42

1 1 062 2 123 3 164 4 195 5 156 6 217 7 238 6 239 5 2510 4 2811 3 2912 2 3313 1 3414 2 3615 3 3716 4 3817 5 3918 6 4119 7 4220 6 4521 5 4722 4 4923 3 5324 2 5625 1 61

Quartiles Measuring spread by examining the middleThe first quartile Q1 is the value in the

sample that has 25 of the data at or

below it (Q1 is the median of the lower

half of the sorted data)

The third quartile Q3 is the value in the

sample that has 75 of the data at or

below it (Q3 is the median of the upper

half of the sorted data)

Quartiles and median divide data into 4 pieces

Q1 M Q3

14 14 14 14

Quartiles are common measures of spread

httpoirpncsueduiradmit

httpoirpncsueduunivpeer

University of Southern California

Economic Value of College Majors

Rules for Calculating QuartilesStep 1 find the median of all the data (the median divides the data in half)

Step 2a find the median of the lower half this median is Q1Step 2b find the median of the upper half this median is Q3

Importantwhen n is odd include the overall median in both halveswhen n is even do not include the overall median in either half

Example 2 4 6 8 10 12 14 16 18 20 n = 10

Median m = (10+12)2 = 222 = 11

Q1 median of lower half 2 4 6 8 10

Q1 = 6

Q3 median of upper half 12 14 16 18 20

Q3 = 16

11

Pulse Rates n = 138

Stem Leaves4

3 4 5889 5 00123344410 5 555678889923 6 0001111112223333334444423 6 5555666666777778888888816 7 0000011222233444423 7 5555566666677788888899910 8 000011222410 8 55556677894 9 00122 9 584 10 0223

101 11 1

Median mean of pulses in locations 69 amp 70 median= (70+70)2=70

Q1 median of lower half (lower half = 69 smallest pulses) Q1 = pulse in ordered position 35Q1 = 63

Q3 median of upper half (upper half = 69 largest pulses) Q3= pulse in position 35 from the high end Q3=78

Below are the weights of 31 linemen on the NCSU football team What is the

value of the first quartile Q1

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 287

2 2575

3 2635

4 2625

Interquartile range another measure of spread

lower quartile Q1

middle quartile median upper quartile Q3

interquartile range (IQR)

IQR = Q3 ndash Q1

measures spread of middle 50 of the data

Example beginning pulse rates

Q3 = 78 Q1 = 63

IQR = 78 ndash 63 = 15

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 3: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

The three rules of data analysis wonrsquot be difficult to remember

1 Make a picture mdashreveals aspects not obvious in the raw data enables you to think clearly about the patterns and relationships that may be hiding in your data

2 Make a picture mdashto show important features of and patterns in the data You may also see things that you did not expect the extraordinary (possibly wrong) data values or unexpected patterns

3 Make a picture mdashthe best way to tell others about your data is with a well-chosen picture

Bar Charts show counts or relative frequency for

each category Example Titanic passengercrew distribution

Titanic Passengers by Class

885

325285

706

000

10000

20000

30000

40000

50000

60000

70000

80000

90000

100000

Crew First Second Third

Pie Charts shows proportions of the

whole in each category Example Titanic passengercrew

distribution Titanic Passengers by Class

Crew40

First15

Second13

Third32

Example Top 10 causes of death in the United States

Rank Causes of death Counts of top 10s

of total deaths

1 Heart disease 700142 37 28

2 Cancer 553768 29 22

3 Cerebrovascular 163538 9 6

4 Chronic respiratory 123013 6 5

5 Accidents 101537 5 4

6 Diabetes mellitus 71372 4 3

7 Flu and pneumonia 62034 3 2

8 Alzheimerrsquos disease 53852 3 2

9 Kidney disorders 39480 2 2

10 Septicemia 32238 2 1

All other causes 629967 25

For each individual who died in the United States we record what was the

cause of death The table above is a summary of that information

0100200300400500600700800

Counts

(x1000)

Top 10 causes of deaths in the United States

Top 10 causes of death bar graphEach category is represented by one bar The barrsquos height shows the count (or

sometimes the percentage) for that particular category

The number of individuals who died of an accident in is approximately 100000

0100200300400500600700800

Counts

(x1000)

Bar graph sorted by rank Easy to analyze

Top 10 causes of deaths in the United States

0100200300400500600700800

Cou

nts

(x10

00)

Sorted alphabetically Much less useful

1 United States $1582 China $6443 Japan $544 Germany $2445 Britain $2356 France $1937 Brazil $1428 Italy $1319 Australia $12810 India $119

1 United States $13792 Japan $2343 Germany $204 Britain $1685 France $1266 Canada $737 Italy $638 China $54 9 Netherlands $5410 Australia $48

Recent Annual Software Sales ($billions)Recent Annual Computer Hardware Sales ($billion)

NY Times

Percent of people dying fromtop 10 causes of death in the United States

Top 10 causes of death pie chartEach slice represents a piece of one whole The size of a slice depends on what

percent of the whole this category represents

Percent of deaths from top 10 causes

Percent of deaths from

all causes

Make sure your labels match

the data

Make sure all percents

add up to 100

Internships

Basic bar chart Side-by-side bar chart

Trend Student Debt by State (grads of public 4 yr or more)

NewHam

pshir

e

Delawar

e

Minn

esot

a

South

Caroli

na

Alabam

a

Illino

is

Mon

tana

NewJe

rsey

India

na

Wes

tVirg

inia

Wisc

onsin

Idah

o

Kansa

s

Arkan

sas

Kentu

cky

Ore

gon

Nebra

ska

Colora

do

North

Caroli

na

Wyo

ming

Was

hingt

on

Florida

NewYor

k

Okla

hom

a

Califo

rnia

0

5000

10000

15000

20000

25000

30000

35000

40000

2009-10 2012-13 National Average2009-10 $216042012-13 $25043

Campbell University IncNew Life Theological Seminary

Meredith CollegeMid-Atlantic Christian University

Wake Forest UniversityMethodist University

Johnson C Smith UniversityChowan University

Catawba CollegeMars Hill College

Elon UniversityWingate University

Lenoir-Rhyne UniversityDavidson College

St Andrews Presbyterian CollegeDuke University

Belmont Abbey CollegeMean North Carolina - 4-year or above

Brevard CollegeWarren Wilson College

Mount Olive CollegeSalem College

Saint Augustines CollegeHigh Point University

0 20000 40000 60000

North Carolina Private Schools

Tuition and fees (in-state) Average debt of graduates

UNC Greensboro

UNC School of the Arts

NC A amp T

Mean North Carolina - 4-year or above

NCSU

UNC-Wilmington

UNC Charlotte

ECU

Appalachian

UNC Asheville

Elizabeth City

0 5000 10000 15000 20000 25000

North Carolina Public Schools

Tuition and fees (in-state) Average debt of graduates

Student Debt North Carolina Schools

Unnecessary dimension in a pie chart

3rd dimension is unnecessary the 3D pie chart does not convey any more information than a 2D pie chart

Section 31 continuedDisplaying Quantitative Data

Histograms

Stem and Leaf Displays

Frequency HistogramsBAKER CITY HOSPITAL - LENGTH OF STAY

DISTRIBUTION

0

10

20

30

40

50

60

70

0lt2 2lt4 4lt6 6lt8 8lt10 10lt12 12lt14 14lt16 16lt18

Relative Frequency Histogram of Exam Grades

005

10

15

20

25

30

40 50 60 70 80 90Grade

Rel

ativ

e fr

eque

ncy

100

Histograms

A histogram shows three general types of information

It provides visual indication of where the approximate center of the data is

We can gain an understanding of the degree of spread or variation in the data

We can observe the shape of the distribution

Histograms Showing Different Centers

0

10

20

30

40

50

60

70

0lt2 2lt4 4lt6 6lt8 8lt10 10lt12 12lt14 14lt16 16lt18

0

10

20

30

40

50

60

70

0lt2 2lt4 4lt6 6lt8 8lt10 10lt12 12lt14 14lt16 16lt18

Histograms - Same Center Different Spread

0

10

20

30

40

50

60

70

0lt2

2lt4

4lt6

6lt8

8lt10

10lt12

12lt14

14lt16

16lt18

0

10

20

30

40

50

60

70

0lt2 2lt4 4lt6 6lt8 8lt10 10lt12 12lt14 14lt16 16lt18

Histograms Shape

A distribution is symmetric if the right and left

sides of the histogram are approximately mirror

images of each other

Symmetric distribution

Complex multimodal distribution

Not all distributions have a simple overall shape

especially when there are few observations

Skewed distribution

A distribution is skewed to the right if the right

side of the histogram (side with larger values)

extends much farther out than the left side It is

skewed to the left if the left side of the histogram

extends much farther out than the right side

Shape (cont)Female heart attack patients in New York state

Age left-skewed Cost right-skewed

Shape (cont) outliersAll 200 m Races 202 secs or less

192 1926193219381944 195 1956196219681974 198 1986199219982004 201 20160

10

20

30

40

50

60

200 m Races 202 secs or less (approx 700)

TIMES

Fre

qu

ency Usain Bolt

2008 1930Michael Johnson1996 1932

Alaska Florida

Shape (cont) Outliers

An important kind of deviation is an outlier Outliers are observations

that lie outside the overall pattern of a distribution Always look for

outliers and try to explain them

The overall pattern is fairly

symmetrical except for 2

states clearly not belonging

to the main trend Alaska

and Florida have unusual

representation of the

elderly in their population

A large gap in the

distribution is typically a

sign of an outlier

Excel Example 2012-13 NFL Salaries

3694

80

1273

609

231

2177

738

462

3081

867

692

3985

996

923

4890

126

154

5794

255

385

6698

384

615

7602

513

846

8506

643

077

9410

772

308

1031

4901

54

1121

9030

77

1212

3160

1302

7289

23

1393

1418

46

1483

5547

69

1573

9676

92

1664

3806

15

1754

7935

38

0

100

200

300

400

500

600

700

800

900

1000

Histogram

Bin

Fre

qu

ency

Statcrunch Example 2012-13 NFL Salaries

Heights of Students in Recent Stats Class (Bimodal)

ExampleGrades on a statistics exam

Data

75 66 77 66 64 73 91 65 59 86 61 86 61

58 70 77 80 58 94 78 62 79 83 54 52 45

82 48 67 55

Example-2Frequency Distribution of Grades

Class Limits Frequency40 up to 50

50 up to 60

60 up to 70

70 up to 80

80 up to 90

90 up to 100

Total

2

6

8

7

5

2

30

Example-3 Relative Frequency Distribution of Grades

Class Limits Relative Frequency40 up to 50

50 up to 60

60 up to 70

70 up to 80

80 up to 90

90 up to 100

230 = 067

630 = 200

830 = 267

730 = 233

530 = 167

230 = 067

Relative Frequency Histogram of Grades

005

10

15

20

25

30

40 50 60 70 80 90Grade

Rel

ativ

e fr

eque

ncy

100

Based on the histo-gram about what percent of the values are between 475 and 525

1 50

2 5

3 17

4 30

Stem and leaf displays Have the following general appearance

stem leaf

1 8 9

2 1 2 8 9 9

3 2 3 8 9

4 0 1

5 6 7

6 4

Example employee ages at a small company

18 21 22 19 32 33 40 41 56 57 64 28 29 29 38 39 stem 10rsquos digit leaf 1rsquos digit

18 stem=1 leaf=8 18 = 1 | 8

stem leaf

1 8 9

2 1 2 8 9 9

3 2 3 8 9

4 0 1

5 6 7

6 4

Suppose a 95 yr old is hiredstem leaf

1 8 9

2 1 2 8 9 9

3 2 3 8 9

4 0 1

5 6 7

6 4

7

8

9 5

Number of TD passes by NFL teams 2012-2013 season(stems are 10rsquos digit)

stem leaf

43

03247

2 6677789

2 01222233444

1 13467889

0 8

Pulse Rates n = 138

Stem Leaves 4 3 4 588 9 5 001233444 10 5 5556788899 23 6 00011111122233333344444 23 6 55556666667777788888888 16 7 00000112222334444 23 7 55555666666777888888999 10 8 0000112224 10 8 5555667789 4 9 0012 2 9 58 4 10 0223 10 1 11 1

AdvantagesDisadvantages of Stem-and-Leaf Displays

Advantages

1) each measurement displayed

2) ascending order in each stem row

3) relatively simple (data set not too large) Disadvantages

display becomes unwieldy for large data sets

Population of 185 US cities with between 100000 and 500000

Multiply stems by 100000

Back-to-back stem-and-leaf displays TD passes by NFL teams 1999-2000 2012-13multiply stems by 10

1999-2000 2012-13

2 4 03

6 3 7

2 3 24

6655 2 6677789

43322221100 2 01222233444

9998887666 1 67889

421 1 134

0 8

Below is a stem-and-leaf display for the pulse rates of 24 women at a health clinic How many pulses are between 67 and 77

Stems are 10rsquos digits

1 4

2 6

3 8

4 10

5 12

Other Graphical Methods for Data Time plots

plot observations in time order time on horizontal axis variable on vertical axis

Time series

measurements are taken at regular intervals (monthly unemployment quarterly GDP weather records electricity demand etc)

Heat maps word walls

Unemployment Rate by Educational Attainment

Water Use During Super Bowl XLV(Packers 31 Steelers 25)

Heat Maps

Word Wall (customer feedback)

Section 32Describing the Center of Data

Mean

Median

2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability (next section)

measures how ldquospread outrdquo the data is

Notation for Data Valuesand Sample Mean

1 2

1 2

3

The sample size is denoted by

For a variable denoted by its observations are denoted by

A common measure of center is the sample mean

The sample mean is denoted by

Shorte

n

n

y y yy

n

y

y y y y

y

n

1 21

1

ned expression for using the symbol

(uppercase Greek letter sigma)n

n

i

i n

i

i

y

y y y

yy

n

y

Simple Example of Sample Mean

Weekly TV viewing time in hours of 7 randomly selected 4th graders

19 40 16 12 10 6 and 97

1

7

1

19 40 16 12 10 6 9 112

11216

7 7

ii

ii

y

yy

Population Mean

1

population

population mea

Denoted by the Greek letter

is the size (for example =34000 for NCSU)

the value of is typically not known

we often use the sample mean

to estimat

n

e the unknown

N

ii

y

N N

y

N

value of

Connection Between Mean and Histogram

A histogram balances when supported at the mean Mean x = 1406

Histogram

0

10

20

30

40

50

60

70

118

5

125

5

132

5

139

5

146

5

153

5

16

05

Mo

re

Absences f rom Work

Fre

qu

en

cy

Frequency

The median anothermeasure of center

Given a set of n data values arranged in order of magnitude

Median= middle value n odd

mean of 2 middle values n even

Ex 2 4 6 8 10 n=5 median=6 Ex 2 4 6 8 n=4 median=(4+6)2=5

Student Pulse Rates (n=62)

38 59 60 60 62 62 63 63 64 64 65 67 68 70 70 70 70 70 70 70 71 71 72 72 73 74 74 75 75 75 75 76 77 77 77 77 78 78 79 79 80 80 80 84 84 85 85 87 90 90 91 92 93 94 94 95 96 96 96 98 98 103

Median = (75+76)2 = 755

The median splits the histogram into 2 halves of equal area

Mean balance pointMedian 50 area each half

mean 5526 years median 577years

Medians are used often

Year 2011 baseball salaries

Median $1450000 (max=$32000000 Alex Rodriguez min=$414000)

Median fan age MLB 45 NFL 43 NBA 41 NHL 39

Median existing home sales price May 2011 $166500 May 2010 $174600

Median household income (2008 dollars) 2009 $50221 2008 $52029

Examples Example n = 7

175 28 32 139 141 253 458 Example n = 7 (ordered) 28 32 139 141 175 253 458 Example n = 8

175 28 32 139 141 253 357 458

Example n =8 (ordered)

28 32 139 141 175 253 357 458

m = 141

m = (141+175)2 = 158

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496049604971524555467586

1 5245

2 49655

3 4960

4 4971

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496052455546497155877586

1 5245

2 49655

3 5546

4 4971

Properties of Mean Median1The mean and median are unique that is a

data set has only 1 mean and 1 median (the mean and median are not necessarily equal)

2The mean uses the value of every number in the data set the median does not

14

20 4 6Ex 2 4 6 8 5 5

4 2

21 4 6Ex 2 4 6 9 5 5

4 2

x m

x m

Example class pulse rates

53 64 67 67 70 76 77 77 78 83 84 85 85 89 90 90 90 90 91 96 98 103 140

23

1

23

844823

location 12th obs 85

ii

n

xx

m m

2010 2014 baseball salaries

2010

n = 845

mean = $3297828

median = $1330000

max = $33000000

2014

n = 848

mean = $3932912

median = $1456250

max = $28000000

>

Disadvantage of the mean

Can be greatly influenced by just a few observations that are much greater or much smaller than the rest of the data

Mean Median Maximum Baseball Salaries 1985 - 201419

85

1987

1989

1991

1993

1995

1997

1999

2001

2003

2005

2007

2009

2011

2013

200000

700000

1200000

1700000

2200000

2700000

3200000

3700000

0

5000000

10000000

15000000

20000000

25000000

30000000

35000000

Baseball Salaries Mean Median and Maximum 1985-2014

Mean Median Maximum

Year

Mea

n M

edia

n S

alar

y

Max

imu

m S

alar

y

Skewness comparing the mean and median

Skewed to the right (positively skewed) meangtmedian

53

490

102 7235 21 26 17 8 10 2 3 1 0 0 1

0

100

200

300

400

500

600

Freq

uenc

y

Salary ($1000s)

2011 Baseball Salaries

Skewed to the left negatively skewed

Mean lt median mean=78 median=87

Histogram of Exam Scores

0

10

20

30

20 30 40 50 60 70 80 90 100Exam Scores

Fre

qu

en

cy

Symmetric data

mean median approx equal

Bank Customers 1000-1100 am

0

5

10

15

20

Number of Customers

Fre

qu

en

cy

Section 33Describing Variability of Data

Standard Deviation

Using the Mean and Standard Deviation Together 68-95-997

Rule (Empirical Rule)

Recall 2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability

measures how ldquospread outrdquo the data is

Ways to measure variability

1 range=largest-smallest

ok sometimes in general too crude sensitive to one large or small obs

1

2 where

the middle is the mean

deviation of from the mean

( ) sum the deviations of all the s from

measure spread from the middle

i i

n

i ii

y

y y y

y y y y

1

( ) 0 always tells us nothingn

ii

y y

Example

1 2

1 2

1 2

1 2

sum of deviations from mean

49 51 50

( ) ( ) (49 50) (51 50) 1 1 0

0 100

Data set 1

Data set 2 50

( ) ( ) (0 50) (100 50) 50 50 0

x x x

x x x x

y y y

y y y y

The Sample Standard Deviation a measure of spread around the mean Square the deviation of each

observation from the mean find the square root of the ldquoaveragerdquo of these squared deviations

2

1

2

2 1

( )sample standard deviation

1

( )is called the sample variance

1

n

ii

n

ii

y ys

n

y ys

n

Calculations hellip

Mean = 634

Sum of squared deviations from mean = 852

(n minus 1) = 13 (n minus 1) is called degrees freedom (df)

s2 = variance = 85213 = 655 square inches

s = standard deviation = radic655 = 256 inches

Women height (inches)i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

2

1

2 )(1

1xx

ns

n

i

1 First calculate the variance s22 Then take the square root to get the

standard deviation s

2

1

)(1

1xx

ns

n

i

Meanplusmn 1 sd

Wersquoll never calculate these by hand so make sure to know how to get the standard deviation using your calculator Excel or other software

Population Standard Deviation

2

1

Denoted by the lower case Greek letter

is the size (for example =34000 for NCSU)

is the mean

( )population standard deviation

va

po

lue of typically not known

us

pulation

populatio

e

n

N

ii

N N

y

N

s

to estimate value of

Remarks

1 The standard deviation of a set of measurements is an estimate of the likely size of the chance error in a single measurement

Remarks (cont)

2 Note that s and s are always greater than or equal to zero

3 The larger the value of s (or s ) the greater the spread of the data

When does s=0 When does s =0

When all data values are the same

Remarks (cont)4 The standard deviation is the most

commonly used measure of risk in finance and businessndash Stocks Mutual Funds etc

5 Variance s2 sample variance 2 population variance Units are squared units of the original data square $ square gallons

Review Properties of s and s s and s are always greater than or

equal to 0

when does s = 0 s = 0 The larger the value of s (or s) the

greater the spread of the data the standard deviation of a set of

measurements is an estimate of the likely size of the chance error in a single measurement

Summary of Notation

2

SAMPLE

sample mean

sample median

sample variance

sample stand dev

y

m

s

s

2

POPULATION

population mean

population median

population variance

population stand dev

m

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

68-95-997 rule

Mean andStandard Deviation

(numerical)

Histogram(graphical)

68-95-997 rule

The 68-95-997 ruleIf the histogram of the data is

approximately bell-shaped then1) approximately of the measurements

are of the mean

that is in ( )

2) approximately of the measurement

68

within 1 standard deviation

95

within 2 standard deviation

s

are of the meas n

that is

y s y s

almost all

within 3 standard deviation

in ( 2 2 )

3) the measurements

are of the mean

that is in ( 3 3 )

s

y s y s

y s y s

68-95-997 rule 68 within 1 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

68

3434

y-s y y+s

68-95-997 rule 95 within 2 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

95

475 475

y-2s y y+2s

Example textbook costs

37548

4272

50

y

s

n

286 291 307 308 315 316 327328 340 342 346 347 348 348 349354 355 355 360 361 364 367 369371 373 377 380 381 382 385 385387 390 390 397 398 409 409 410418 422 424 425 426 428 433 434437 440 480

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( ) (33276 41820)

32percentage of data values in this interval 64

5068-95-997 rule 68

y s

y s y s

1 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 2 2 ) (29004 46092)

48percentage of data values in this interval 96

5068-95-997 rule 95

y s

y s y s

2 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 3 3 ) (24732 50364)

50percentage of data values in this interval 100

5068-95-997 rule 997

y s

y s y s

3 standard deviation interval about the mean

The best estimate of the standard deviation of the menrsquos weights

displayed in this dotplot is

1 10

2 15

3 20

4 40

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

Preceding slides Next

Z-scores Standardized Data Values

Measures the distance of a number from the mean in units of

the standard deviation

z-score corresponding to y

where

original data value

the sample mean

s the sample standard deviation

the z-score corresponding to

y yz

s

y

y

z y

Exam 1 y1 = 88 s1 = 6 your exam 1 score 91

Exam 2 y2 = 88 s2 = 10 your exam 2 score 92

Which score is better

1

2

91 88 3z 5

6 692 88 4

z 410 10

91 on exam 1 is better than 92 on exam 2

If data has mean and standard deviation

then standardizing a particular value of

indicates how many standard deviations

is above or below the mean

y s

y

y

y

Comparing SAT and ACT Scores

SAT Math Eleanorrsquos score 680

SAT mean =500 sd=100 ACT Math Geraldrsquos score 27

ACT mean=18 sd=6 Eleanorrsquos z-score z=(680-500)100=18 Geraldrsquos z-score z=(27-18)6=15 Eleanorrsquos score is better

Z-scores add to zeroStudentInstitutional Support to Athletic Depts For the 9 Public ACC

Schools 2013 ($ millions)

School Support y - ybar Z-score

Maryland 155 64 179

UVA 131 40 112

Louisville 109 18 050

UNC 92 01 003

VaTech 79 -12 -034

FSU 79 -12 -034

GaTech 71 -20 -056

NCSU 65 -26 -073

Clemson 38 -53 -147

Mean=91000 s=35697

Sum = 0 Sum = 0

Recently the mean tuition at 4-yr public collegesuniversities in the US was $6185 with a standard deviation of $1804 In NC the mean tuition was $4320 What is NCrsquos z-score

1 103

2 -103

3 239

4 1865

5 -1865

Section 34Measures of Position (also called Measures of Relative Standing)

Quartiles

5-Number Summary

Interquartile Range Another Measure of Spread

Boxplots

m = median = 34

Q1= first quartile = 23

Q3= third quartile = 42

1 1 062 2 123 3 164 4 195 5 156 6 217 7 238 6 239 5 2510 4 2811 3 2912 2 3313 1 3414 2 3615 3 3716 4 3817 5 3918 6 4119 7 4220 6 4521 5 4722 4 4923 3 5324 2 5625 1 61

Quartiles Measuring spread by examining the middleThe first quartile Q1 is the value in the

sample that has 25 of the data at or

below it (Q1 is the median of the lower

half of the sorted data)

The third quartile Q3 is the value in the

sample that has 75 of the data at or

below it (Q3 is the median of the upper

half of the sorted data)

Quartiles and median divide data into 4 pieces

Q1 M Q3

14 14 14 14

Quartiles are common measures of spread

httpoirpncsueduiradmit

httpoirpncsueduunivpeer

University of Southern California

Economic Value of College Majors

Rules for Calculating QuartilesStep 1 find the median of all the data (the median divides the data in half)

Step 2a find the median of the lower half this median is Q1Step 2b find the median of the upper half this median is Q3

Importantwhen n is odd include the overall median in both halveswhen n is even do not include the overall median in either half

Example 2 4 6 8 10 12 14 16 18 20 n = 10

Median m = (10+12)2 = 222 = 11

Q1 median of lower half 2 4 6 8 10

Q1 = 6

Q3 median of upper half 12 14 16 18 20

Q3 = 16

11

Pulse Rates n = 138

Stem Leaves4

3 4 5889 5 00123344410 5 555678889923 6 0001111112223333334444423 6 5555666666777778888888816 7 0000011222233444423 7 5555566666677788888899910 8 000011222410 8 55556677894 9 00122 9 584 10 0223

101 11 1

Median mean of pulses in locations 69 amp 70 median= (70+70)2=70

Q1 median of lower half (lower half = 69 smallest pulses) Q1 = pulse in ordered position 35Q1 = 63

Q3 median of upper half (upper half = 69 largest pulses) Q3= pulse in position 35 from the high end Q3=78

Below are the weights of 31 linemen on the NCSU football team What is the

value of the first quartile Q1

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 287

2 2575

3 2635

4 2625

Interquartile range another measure of spread

lower quartile Q1

middle quartile median upper quartile Q3

interquartile range (IQR)

IQR = Q3 ndash Q1

measures spread of middle 50 of the data

Example beginning pulse rates

Q3 = 78 Q1 = 63

IQR = 78 ndash 63 = 15

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 4: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Bar Charts show counts or relative frequency for

each category Example Titanic passengercrew distribution

Titanic Passengers by Class

885

325285

706

000

10000

20000

30000

40000

50000

60000

70000

80000

90000

100000

Crew First Second Third

Pie Charts shows proportions of the

whole in each category Example Titanic passengercrew

distribution Titanic Passengers by Class

Crew40

First15

Second13

Third32

Example Top 10 causes of death in the United States

Rank Causes of death Counts of top 10s

of total deaths

1 Heart disease 700142 37 28

2 Cancer 553768 29 22

3 Cerebrovascular 163538 9 6

4 Chronic respiratory 123013 6 5

5 Accidents 101537 5 4

6 Diabetes mellitus 71372 4 3

7 Flu and pneumonia 62034 3 2

8 Alzheimerrsquos disease 53852 3 2

9 Kidney disorders 39480 2 2

10 Septicemia 32238 2 1

All other causes 629967 25

For each individual who died in the United States we record what was the

cause of death The table above is a summary of that information

0100200300400500600700800

Counts

(x1000)

Top 10 causes of deaths in the United States

Top 10 causes of death bar graphEach category is represented by one bar The barrsquos height shows the count (or

sometimes the percentage) for that particular category

The number of individuals who died of an accident in is approximately 100000

0100200300400500600700800

Counts

(x1000)

Bar graph sorted by rank Easy to analyze

Top 10 causes of deaths in the United States

0100200300400500600700800

Cou

nts

(x10

00)

Sorted alphabetically Much less useful

1 United States $1582 China $6443 Japan $544 Germany $2445 Britain $2356 France $1937 Brazil $1428 Italy $1319 Australia $12810 India $119

1 United States $13792 Japan $2343 Germany $204 Britain $1685 France $1266 Canada $737 Italy $638 China $54 9 Netherlands $5410 Australia $48

Recent Annual Software Sales ($billions)Recent Annual Computer Hardware Sales ($billion)

NY Times

Percent of people dying fromtop 10 causes of death in the United States

Top 10 causes of death pie chartEach slice represents a piece of one whole The size of a slice depends on what

percent of the whole this category represents

Percent of deaths from top 10 causes

Percent of deaths from

all causes

Make sure your labels match

the data

Make sure all percents

add up to 100

Internships

Basic bar chart Side-by-side bar chart

Trend Student Debt by State (grads of public 4 yr or more)

NewHam

pshir

e

Delawar

e

Minn

esot

a

South

Caroli

na

Alabam

a

Illino

is

Mon

tana

NewJe

rsey

India

na

Wes

tVirg

inia

Wisc

onsin

Idah

o

Kansa

s

Arkan

sas

Kentu

cky

Ore

gon

Nebra

ska

Colora

do

North

Caroli

na

Wyo

ming

Was

hingt

on

Florida

NewYor

k

Okla

hom

a

Califo

rnia

0

5000

10000

15000

20000

25000

30000

35000

40000

2009-10 2012-13 National Average2009-10 $216042012-13 $25043

Campbell University IncNew Life Theological Seminary

Meredith CollegeMid-Atlantic Christian University

Wake Forest UniversityMethodist University

Johnson C Smith UniversityChowan University

Catawba CollegeMars Hill College

Elon UniversityWingate University

Lenoir-Rhyne UniversityDavidson College

St Andrews Presbyterian CollegeDuke University

Belmont Abbey CollegeMean North Carolina - 4-year or above

Brevard CollegeWarren Wilson College

Mount Olive CollegeSalem College

Saint Augustines CollegeHigh Point University

0 20000 40000 60000

North Carolina Private Schools

Tuition and fees (in-state) Average debt of graduates

UNC Greensboro

UNC School of the Arts

NC A amp T

Mean North Carolina - 4-year or above

NCSU

UNC-Wilmington

UNC Charlotte

ECU

Appalachian

UNC Asheville

Elizabeth City

0 5000 10000 15000 20000 25000

North Carolina Public Schools

Tuition and fees (in-state) Average debt of graduates

Student Debt North Carolina Schools

Unnecessary dimension in a pie chart

3rd dimension is unnecessary the 3D pie chart does not convey any more information than a 2D pie chart

Section 31 continuedDisplaying Quantitative Data

Histograms

Stem and Leaf Displays

Frequency HistogramsBAKER CITY HOSPITAL - LENGTH OF STAY

DISTRIBUTION

0

10

20

30

40

50

60

70

0lt2 2lt4 4lt6 6lt8 8lt10 10lt12 12lt14 14lt16 16lt18

Relative Frequency Histogram of Exam Grades

005

10

15

20

25

30

40 50 60 70 80 90Grade

Rel

ativ

e fr

eque

ncy

100

Histograms

A histogram shows three general types of information

It provides visual indication of where the approximate center of the data is

We can gain an understanding of the degree of spread or variation in the data

We can observe the shape of the distribution

Histograms Showing Different Centers

0

10

20

30

40

50

60

70

0lt2 2lt4 4lt6 6lt8 8lt10 10lt12 12lt14 14lt16 16lt18

0

10

20

30

40

50

60

70

0lt2 2lt4 4lt6 6lt8 8lt10 10lt12 12lt14 14lt16 16lt18

Histograms - Same Center Different Spread

0

10

20

30

40

50

60

70

0lt2

2lt4

4lt6

6lt8

8lt10

10lt12

12lt14

14lt16

16lt18

0

10

20

30

40

50

60

70

0lt2 2lt4 4lt6 6lt8 8lt10 10lt12 12lt14 14lt16 16lt18

Histograms Shape

A distribution is symmetric if the right and left

sides of the histogram are approximately mirror

images of each other

Symmetric distribution

Complex multimodal distribution

Not all distributions have a simple overall shape

especially when there are few observations

Skewed distribution

A distribution is skewed to the right if the right

side of the histogram (side with larger values)

extends much farther out than the left side It is

skewed to the left if the left side of the histogram

extends much farther out than the right side

Shape (cont)Female heart attack patients in New York state

Age left-skewed Cost right-skewed

Shape (cont) outliersAll 200 m Races 202 secs or less

192 1926193219381944 195 1956196219681974 198 1986199219982004 201 20160

10

20

30

40

50

60

200 m Races 202 secs or less (approx 700)

TIMES

Fre

qu

ency Usain Bolt

2008 1930Michael Johnson1996 1932

Alaska Florida

Shape (cont) Outliers

An important kind of deviation is an outlier Outliers are observations

that lie outside the overall pattern of a distribution Always look for

outliers and try to explain them

The overall pattern is fairly

symmetrical except for 2

states clearly not belonging

to the main trend Alaska

and Florida have unusual

representation of the

elderly in their population

A large gap in the

distribution is typically a

sign of an outlier

Excel Example 2012-13 NFL Salaries

3694

80

1273

609

231

2177

738

462

3081

867

692

3985

996

923

4890

126

154

5794

255

385

6698

384

615

7602

513

846

8506

643

077

9410

772

308

1031

4901

54

1121

9030

77

1212

3160

1302

7289

23

1393

1418

46

1483

5547

69

1573

9676

92

1664

3806

15

1754

7935

38

0

100

200

300

400

500

600

700

800

900

1000

Histogram

Bin

Fre

qu

ency

Statcrunch Example 2012-13 NFL Salaries

Heights of Students in Recent Stats Class (Bimodal)

ExampleGrades on a statistics exam

Data

75 66 77 66 64 73 91 65 59 86 61 86 61

58 70 77 80 58 94 78 62 79 83 54 52 45

82 48 67 55

Example-2Frequency Distribution of Grades

Class Limits Frequency40 up to 50

50 up to 60

60 up to 70

70 up to 80

80 up to 90

90 up to 100

Total

2

6

8

7

5

2

30

Example-3 Relative Frequency Distribution of Grades

Class Limits Relative Frequency40 up to 50

50 up to 60

60 up to 70

70 up to 80

80 up to 90

90 up to 100

230 = 067

630 = 200

830 = 267

730 = 233

530 = 167

230 = 067

Relative Frequency Histogram of Grades

005

10

15

20

25

30

40 50 60 70 80 90Grade

Rel

ativ

e fr

eque

ncy

100

Based on the histo-gram about what percent of the values are between 475 and 525

1 50

2 5

3 17

4 30

Stem and leaf displays Have the following general appearance

stem leaf

1 8 9

2 1 2 8 9 9

3 2 3 8 9

4 0 1

5 6 7

6 4

Example employee ages at a small company

18 21 22 19 32 33 40 41 56 57 64 28 29 29 38 39 stem 10rsquos digit leaf 1rsquos digit

18 stem=1 leaf=8 18 = 1 | 8

stem leaf

1 8 9

2 1 2 8 9 9

3 2 3 8 9

4 0 1

5 6 7

6 4

Suppose a 95 yr old is hiredstem leaf

1 8 9

2 1 2 8 9 9

3 2 3 8 9

4 0 1

5 6 7

6 4

7

8

9 5

Number of TD passes by NFL teams 2012-2013 season(stems are 10rsquos digit)

stem leaf

43

03247

2 6677789

2 01222233444

1 13467889

0 8

Pulse Rates n = 138

Stem Leaves 4 3 4 588 9 5 001233444 10 5 5556788899 23 6 00011111122233333344444 23 6 55556666667777788888888 16 7 00000112222334444 23 7 55555666666777888888999 10 8 0000112224 10 8 5555667789 4 9 0012 2 9 58 4 10 0223 10 1 11 1

AdvantagesDisadvantages of Stem-and-Leaf Displays

Advantages

1) each measurement displayed

2) ascending order in each stem row

3) relatively simple (data set not too large) Disadvantages

display becomes unwieldy for large data sets

Population of 185 US cities with between 100000 and 500000

Multiply stems by 100000

Back-to-back stem-and-leaf displays TD passes by NFL teams 1999-2000 2012-13multiply stems by 10

1999-2000 2012-13

2 4 03

6 3 7

2 3 24

6655 2 6677789

43322221100 2 01222233444

9998887666 1 67889

421 1 134

0 8

Below is a stem-and-leaf display for the pulse rates of 24 women at a health clinic How many pulses are between 67 and 77

Stems are 10rsquos digits

1 4

2 6

3 8

4 10

5 12

Other Graphical Methods for Data Time plots

plot observations in time order time on horizontal axis variable on vertical axis

Time series

measurements are taken at regular intervals (monthly unemployment quarterly GDP weather records electricity demand etc)

Heat maps word walls

Unemployment Rate by Educational Attainment

Water Use During Super Bowl XLV(Packers 31 Steelers 25)

Heat Maps

Word Wall (customer feedback)

Section 32Describing the Center of Data

Mean

Median

2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability (next section)

measures how ldquospread outrdquo the data is

Notation for Data Valuesand Sample Mean

1 2

1 2

3

The sample size is denoted by

For a variable denoted by its observations are denoted by

A common measure of center is the sample mean

The sample mean is denoted by

Shorte

n

n

y y yy

n

y

y y y y

y

n

1 21

1

ned expression for using the symbol

(uppercase Greek letter sigma)n

n

i

i n

i

i

y

y y y

yy

n

y

Simple Example of Sample Mean

Weekly TV viewing time in hours of 7 randomly selected 4th graders

19 40 16 12 10 6 and 97

1

7

1

19 40 16 12 10 6 9 112

11216

7 7

ii

ii

y

yy

Population Mean

1

population

population mea

Denoted by the Greek letter

is the size (for example =34000 for NCSU)

the value of is typically not known

we often use the sample mean

to estimat

n

e the unknown

N

ii

y

N N

y

N

value of

Connection Between Mean and Histogram

A histogram balances when supported at the mean Mean x = 1406

Histogram

0

10

20

30

40

50

60

70

118

5

125

5

132

5

139

5

146

5

153

5

16

05

Mo

re

Absences f rom Work

Fre

qu

en

cy

Frequency

The median anothermeasure of center

Given a set of n data values arranged in order of magnitude

Median= middle value n odd

mean of 2 middle values n even

Ex 2 4 6 8 10 n=5 median=6 Ex 2 4 6 8 n=4 median=(4+6)2=5

Student Pulse Rates (n=62)

38 59 60 60 62 62 63 63 64 64 65 67 68 70 70 70 70 70 70 70 71 71 72 72 73 74 74 75 75 75 75 76 77 77 77 77 78 78 79 79 80 80 80 84 84 85 85 87 90 90 91 92 93 94 94 95 96 96 96 98 98 103

Median = (75+76)2 = 755

The median splits the histogram into 2 halves of equal area

Mean balance pointMedian 50 area each half

mean 5526 years median 577years

Medians are used often

Year 2011 baseball salaries

Median $1450000 (max=$32000000 Alex Rodriguez min=$414000)

Median fan age MLB 45 NFL 43 NBA 41 NHL 39

Median existing home sales price May 2011 $166500 May 2010 $174600

Median household income (2008 dollars) 2009 $50221 2008 $52029

Examples Example n = 7

175 28 32 139 141 253 458 Example n = 7 (ordered) 28 32 139 141 175 253 458 Example n = 8

175 28 32 139 141 253 357 458

Example n =8 (ordered)

28 32 139 141 175 253 357 458

m = 141

m = (141+175)2 = 158

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496049604971524555467586

1 5245

2 49655

3 4960

4 4971

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496052455546497155877586

1 5245

2 49655

3 5546

4 4971

Properties of Mean Median1The mean and median are unique that is a

data set has only 1 mean and 1 median (the mean and median are not necessarily equal)

2The mean uses the value of every number in the data set the median does not

14

20 4 6Ex 2 4 6 8 5 5

4 2

21 4 6Ex 2 4 6 9 5 5

4 2

x m

x m

Example class pulse rates

53 64 67 67 70 76 77 77 78 83 84 85 85 89 90 90 90 90 91 96 98 103 140

23

1

23

844823

location 12th obs 85

ii

n

xx

m m

2010 2014 baseball salaries

2010

n = 845

mean = $3297828

median = $1330000

max = $33000000

2014

n = 848

mean = $3932912

median = $1456250

max = $28000000

>

Disadvantage of the mean

Can be greatly influenced by just a few observations that are much greater or much smaller than the rest of the data

Mean Median Maximum Baseball Salaries 1985 - 201419

85

1987

1989

1991

1993

1995

1997

1999

2001

2003

2005

2007

2009

2011

2013

200000

700000

1200000

1700000

2200000

2700000

3200000

3700000

0

5000000

10000000

15000000

20000000

25000000

30000000

35000000

Baseball Salaries Mean Median and Maximum 1985-2014

Mean Median Maximum

Year

Mea

n M

edia

n S

alar

y

Max

imu

m S

alar

y

Skewness comparing the mean and median

Skewed to the right (positively skewed) meangtmedian

53

490

102 7235 21 26 17 8 10 2 3 1 0 0 1

0

100

200

300

400

500

600

Freq

uenc

y

Salary ($1000s)

2011 Baseball Salaries

Skewed to the left negatively skewed

Mean lt median mean=78 median=87

Histogram of Exam Scores

0

10

20

30

20 30 40 50 60 70 80 90 100Exam Scores

Fre

qu

en

cy

Symmetric data

mean median approx equal

Bank Customers 1000-1100 am

0

5

10

15

20

Number of Customers

Fre

qu

en

cy

Section 33Describing Variability of Data

Standard Deviation

Using the Mean and Standard Deviation Together 68-95-997

Rule (Empirical Rule)

Recall 2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability

measures how ldquospread outrdquo the data is

Ways to measure variability

1 range=largest-smallest

ok sometimes in general too crude sensitive to one large or small obs

1

2 where

the middle is the mean

deviation of from the mean

( ) sum the deviations of all the s from

measure spread from the middle

i i

n

i ii

y

y y y

y y y y

1

( ) 0 always tells us nothingn

ii

y y

Example

1 2

1 2

1 2

1 2

sum of deviations from mean

49 51 50

( ) ( ) (49 50) (51 50) 1 1 0

0 100

Data set 1

Data set 2 50

( ) ( ) (0 50) (100 50) 50 50 0

x x x

x x x x

y y y

y y y y

The Sample Standard Deviation a measure of spread around the mean Square the deviation of each

observation from the mean find the square root of the ldquoaveragerdquo of these squared deviations

2

1

2

2 1

( )sample standard deviation

1

( )is called the sample variance

1

n

ii

n

ii

y ys

n

y ys

n

Calculations hellip

Mean = 634

Sum of squared deviations from mean = 852

(n minus 1) = 13 (n minus 1) is called degrees freedom (df)

s2 = variance = 85213 = 655 square inches

s = standard deviation = radic655 = 256 inches

Women height (inches)i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

2

1

2 )(1

1xx

ns

n

i

1 First calculate the variance s22 Then take the square root to get the

standard deviation s

2

1

)(1

1xx

ns

n

i

Meanplusmn 1 sd

Wersquoll never calculate these by hand so make sure to know how to get the standard deviation using your calculator Excel or other software

Population Standard Deviation

2

1

Denoted by the lower case Greek letter

is the size (for example =34000 for NCSU)

is the mean

( )population standard deviation

va

po

lue of typically not known

us

pulation

populatio

e

n

N

ii

N N

y

N

s

to estimate value of

Remarks

1 The standard deviation of a set of measurements is an estimate of the likely size of the chance error in a single measurement

Remarks (cont)

2 Note that s and s are always greater than or equal to zero

3 The larger the value of s (or s ) the greater the spread of the data

When does s=0 When does s =0

When all data values are the same

Remarks (cont)4 The standard deviation is the most

commonly used measure of risk in finance and businessndash Stocks Mutual Funds etc

5 Variance s2 sample variance 2 population variance Units are squared units of the original data square $ square gallons

Review Properties of s and s s and s are always greater than or

equal to 0

when does s = 0 s = 0 The larger the value of s (or s) the

greater the spread of the data the standard deviation of a set of

measurements is an estimate of the likely size of the chance error in a single measurement

Summary of Notation

2

SAMPLE

sample mean

sample median

sample variance

sample stand dev

y

m

s

s

2

POPULATION

population mean

population median

population variance

population stand dev

m

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

68-95-997 rule

Mean andStandard Deviation

(numerical)

Histogram(graphical)

68-95-997 rule

The 68-95-997 ruleIf the histogram of the data is

approximately bell-shaped then1) approximately of the measurements

are of the mean

that is in ( )

2) approximately of the measurement

68

within 1 standard deviation

95

within 2 standard deviation

s

are of the meas n

that is

y s y s

almost all

within 3 standard deviation

in ( 2 2 )

3) the measurements

are of the mean

that is in ( 3 3 )

s

y s y s

y s y s

68-95-997 rule 68 within 1 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

68

3434

y-s y y+s

68-95-997 rule 95 within 2 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

95

475 475

y-2s y y+2s

Example textbook costs

37548

4272

50

y

s

n

286 291 307 308 315 316 327328 340 342 346 347 348 348 349354 355 355 360 361 364 367 369371 373 377 380 381 382 385 385387 390 390 397 398 409 409 410418 422 424 425 426 428 433 434437 440 480

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( ) (33276 41820)

32percentage of data values in this interval 64

5068-95-997 rule 68

y s

y s y s

1 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 2 2 ) (29004 46092)

48percentage of data values in this interval 96

5068-95-997 rule 95

y s

y s y s

2 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 3 3 ) (24732 50364)

50percentage of data values in this interval 100

5068-95-997 rule 997

y s

y s y s

3 standard deviation interval about the mean

The best estimate of the standard deviation of the menrsquos weights

displayed in this dotplot is

1 10

2 15

3 20

4 40

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

Preceding slides Next

Z-scores Standardized Data Values

Measures the distance of a number from the mean in units of

the standard deviation

z-score corresponding to y

where

original data value

the sample mean

s the sample standard deviation

the z-score corresponding to

y yz

s

y

y

z y

Exam 1 y1 = 88 s1 = 6 your exam 1 score 91

Exam 2 y2 = 88 s2 = 10 your exam 2 score 92

Which score is better

1

2

91 88 3z 5

6 692 88 4

z 410 10

91 on exam 1 is better than 92 on exam 2

If data has mean and standard deviation

then standardizing a particular value of

indicates how many standard deviations

is above or below the mean

y s

y

y

y

Comparing SAT and ACT Scores

SAT Math Eleanorrsquos score 680

SAT mean =500 sd=100 ACT Math Geraldrsquos score 27

ACT mean=18 sd=6 Eleanorrsquos z-score z=(680-500)100=18 Geraldrsquos z-score z=(27-18)6=15 Eleanorrsquos score is better

Z-scores add to zeroStudentInstitutional Support to Athletic Depts For the 9 Public ACC

Schools 2013 ($ millions)

School Support y - ybar Z-score

Maryland 155 64 179

UVA 131 40 112

Louisville 109 18 050

UNC 92 01 003

VaTech 79 -12 -034

FSU 79 -12 -034

GaTech 71 -20 -056

NCSU 65 -26 -073

Clemson 38 -53 -147

Mean=91000 s=35697

Sum = 0 Sum = 0

Recently the mean tuition at 4-yr public collegesuniversities in the US was $6185 with a standard deviation of $1804 In NC the mean tuition was $4320 What is NCrsquos z-score

1 103

2 -103

3 239

4 1865

5 -1865

Section 34Measures of Position (also called Measures of Relative Standing)

Quartiles

5-Number Summary

Interquartile Range Another Measure of Spread

Boxplots

m = median = 34

Q1= first quartile = 23

Q3= third quartile = 42

1 1 062 2 123 3 164 4 195 5 156 6 217 7 238 6 239 5 2510 4 2811 3 2912 2 3313 1 3414 2 3615 3 3716 4 3817 5 3918 6 4119 7 4220 6 4521 5 4722 4 4923 3 5324 2 5625 1 61

Quartiles Measuring spread by examining the middleThe first quartile Q1 is the value in the

sample that has 25 of the data at or

below it (Q1 is the median of the lower

half of the sorted data)

The third quartile Q3 is the value in the

sample that has 75 of the data at or

below it (Q3 is the median of the upper

half of the sorted data)

Quartiles and median divide data into 4 pieces

Q1 M Q3

14 14 14 14

Quartiles are common measures of spread

httpoirpncsueduiradmit

httpoirpncsueduunivpeer

University of Southern California

Economic Value of College Majors

Rules for Calculating QuartilesStep 1 find the median of all the data (the median divides the data in half)

Step 2a find the median of the lower half this median is Q1Step 2b find the median of the upper half this median is Q3

Importantwhen n is odd include the overall median in both halveswhen n is even do not include the overall median in either half

Example 2 4 6 8 10 12 14 16 18 20 n = 10

Median m = (10+12)2 = 222 = 11

Q1 median of lower half 2 4 6 8 10

Q1 = 6

Q3 median of upper half 12 14 16 18 20

Q3 = 16

11

Pulse Rates n = 138

Stem Leaves4

3 4 5889 5 00123344410 5 555678889923 6 0001111112223333334444423 6 5555666666777778888888816 7 0000011222233444423 7 5555566666677788888899910 8 000011222410 8 55556677894 9 00122 9 584 10 0223

101 11 1

Median mean of pulses in locations 69 amp 70 median= (70+70)2=70

Q1 median of lower half (lower half = 69 smallest pulses) Q1 = pulse in ordered position 35Q1 = 63

Q3 median of upper half (upper half = 69 largest pulses) Q3= pulse in position 35 from the high end Q3=78

Below are the weights of 31 linemen on the NCSU football team What is the

value of the first quartile Q1

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 287

2 2575

3 2635

4 2625

Interquartile range another measure of spread

lower quartile Q1

middle quartile median upper quartile Q3

interquartile range (IQR)

IQR = Q3 ndash Q1

measures spread of middle 50 of the data

Example beginning pulse rates

Q3 = 78 Q1 = 63

IQR = 78 ndash 63 = 15

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 5: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Pie Charts shows proportions of the

whole in each category Example Titanic passengercrew

distribution Titanic Passengers by Class

Crew40

First15

Second13

Third32

Example Top 10 causes of death in the United States

Rank Causes of death Counts of top 10s

of total deaths

1 Heart disease 700142 37 28

2 Cancer 553768 29 22

3 Cerebrovascular 163538 9 6

4 Chronic respiratory 123013 6 5

5 Accidents 101537 5 4

6 Diabetes mellitus 71372 4 3

7 Flu and pneumonia 62034 3 2

8 Alzheimerrsquos disease 53852 3 2

9 Kidney disorders 39480 2 2

10 Septicemia 32238 2 1

All other causes 629967 25

For each individual who died in the United States we record what was the

cause of death The table above is a summary of that information

0100200300400500600700800

Counts

(x1000)

Top 10 causes of deaths in the United States

Top 10 causes of death bar graphEach category is represented by one bar The barrsquos height shows the count (or

sometimes the percentage) for that particular category

The number of individuals who died of an accident in is approximately 100000

0100200300400500600700800

Counts

(x1000)

Bar graph sorted by rank Easy to analyze

Top 10 causes of deaths in the United States

0100200300400500600700800

Cou

nts

(x10

00)

Sorted alphabetically Much less useful

1 United States $1582 China $6443 Japan $544 Germany $2445 Britain $2356 France $1937 Brazil $1428 Italy $1319 Australia $12810 India $119

1 United States $13792 Japan $2343 Germany $204 Britain $1685 France $1266 Canada $737 Italy $638 China $54 9 Netherlands $5410 Australia $48

Recent Annual Software Sales ($billions)Recent Annual Computer Hardware Sales ($billion)

NY Times

Percent of people dying fromtop 10 causes of death in the United States

Top 10 causes of death pie chartEach slice represents a piece of one whole The size of a slice depends on what

percent of the whole this category represents

Percent of deaths from top 10 causes

Percent of deaths from

all causes

Make sure your labels match

the data

Make sure all percents

add up to 100

Internships

Basic bar chart Side-by-side bar chart

Trend Student Debt by State (grads of public 4 yr or more)

NewHam

pshir

e

Delawar

e

Minn

esot

a

South

Caroli

na

Alabam

a

Illino

is

Mon

tana

NewJe

rsey

India

na

Wes

tVirg

inia

Wisc

onsin

Idah

o

Kansa

s

Arkan

sas

Kentu

cky

Ore

gon

Nebra

ska

Colora

do

North

Caroli

na

Wyo

ming

Was

hingt

on

Florida

NewYor

k

Okla

hom

a

Califo

rnia

0

5000

10000

15000

20000

25000

30000

35000

40000

2009-10 2012-13 National Average2009-10 $216042012-13 $25043

Campbell University IncNew Life Theological Seminary

Meredith CollegeMid-Atlantic Christian University

Wake Forest UniversityMethodist University

Johnson C Smith UniversityChowan University

Catawba CollegeMars Hill College

Elon UniversityWingate University

Lenoir-Rhyne UniversityDavidson College

St Andrews Presbyterian CollegeDuke University

Belmont Abbey CollegeMean North Carolina - 4-year or above

Brevard CollegeWarren Wilson College

Mount Olive CollegeSalem College

Saint Augustines CollegeHigh Point University

0 20000 40000 60000

North Carolina Private Schools

Tuition and fees (in-state) Average debt of graduates

UNC Greensboro

UNC School of the Arts

NC A amp T

Mean North Carolina - 4-year or above

NCSU

UNC-Wilmington

UNC Charlotte

ECU

Appalachian

UNC Asheville

Elizabeth City

0 5000 10000 15000 20000 25000

North Carolina Public Schools

Tuition and fees (in-state) Average debt of graduates

Student Debt North Carolina Schools

Unnecessary dimension in a pie chart

3rd dimension is unnecessary the 3D pie chart does not convey any more information than a 2D pie chart

Section 31 continuedDisplaying Quantitative Data

Histograms

Stem and Leaf Displays

Frequency HistogramsBAKER CITY HOSPITAL - LENGTH OF STAY

DISTRIBUTION

0

10

20

30

40

50

60

70

0lt2 2lt4 4lt6 6lt8 8lt10 10lt12 12lt14 14lt16 16lt18

Relative Frequency Histogram of Exam Grades

005

10

15

20

25

30

40 50 60 70 80 90Grade

Rel

ativ

e fr

eque

ncy

100

Histograms

A histogram shows three general types of information

It provides visual indication of where the approximate center of the data is

We can gain an understanding of the degree of spread or variation in the data

We can observe the shape of the distribution

Histograms Showing Different Centers

0

10

20

30

40

50

60

70

0lt2 2lt4 4lt6 6lt8 8lt10 10lt12 12lt14 14lt16 16lt18

0

10

20

30

40

50

60

70

0lt2 2lt4 4lt6 6lt8 8lt10 10lt12 12lt14 14lt16 16lt18

Histograms - Same Center Different Spread

0

10

20

30

40

50

60

70

0lt2

2lt4

4lt6

6lt8

8lt10

10lt12

12lt14

14lt16

16lt18

0

10

20

30

40

50

60

70

0lt2 2lt4 4lt6 6lt8 8lt10 10lt12 12lt14 14lt16 16lt18

Histograms Shape

A distribution is symmetric if the right and left

sides of the histogram are approximately mirror

images of each other

Symmetric distribution

Complex multimodal distribution

Not all distributions have a simple overall shape

especially when there are few observations

Skewed distribution

A distribution is skewed to the right if the right

side of the histogram (side with larger values)

extends much farther out than the left side It is

skewed to the left if the left side of the histogram

extends much farther out than the right side

Shape (cont)Female heart attack patients in New York state

Age left-skewed Cost right-skewed

Shape (cont) outliersAll 200 m Races 202 secs or less

192 1926193219381944 195 1956196219681974 198 1986199219982004 201 20160

10

20

30

40

50

60

200 m Races 202 secs or less (approx 700)

TIMES

Fre

qu

ency Usain Bolt

2008 1930Michael Johnson1996 1932

Alaska Florida

Shape (cont) Outliers

An important kind of deviation is an outlier Outliers are observations

that lie outside the overall pattern of a distribution Always look for

outliers and try to explain them

The overall pattern is fairly

symmetrical except for 2

states clearly not belonging

to the main trend Alaska

and Florida have unusual

representation of the

elderly in their population

A large gap in the

distribution is typically a

sign of an outlier

Excel Example 2012-13 NFL Salaries

3694

80

1273

609

231

2177

738

462

3081

867

692

3985

996

923

4890

126

154

5794

255

385

6698

384

615

7602

513

846

8506

643

077

9410

772

308

1031

4901

54

1121

9030

77

1212

3160

1302

7289

23

1393

1418

46

1483

5547

69

1573

9676

92

1664

3806

15

1754

7935

38

0

100

200

300

400

500

600

700

800

900

1000

Histogram

Bin

Fre

qu

ency

Statcrunch Example 2012-13 NFL Salaries

Heights of Students in Recent Stats Class (Bimodal)

ExampleGrades on a statistics exam

Data

75 66 77 66 64 73 91 65 59 86 61 86 61

58 70 77 80 58 94 78 62 79 83 54 52 45

82 48 67 55

Example-2Frequency Distribution of Grades

Class Limits Frequency40 up to 50

50 up to 60

60 up to 70

70 up to 80

80 up to 90

90 up to 100

Total

2

6

8

7

5

2

30

Example-3 Relative Frequency Distribution of Grades

Class Limits Relative Frequency40 up to 50

50 up to 60

60 up to 70

70 up to 80

80 up to 90

90 up to 100

230 = 067

630 = 200

830 = 267

730 = 233

530 = 167

230 = 067

Relative Frequency Histogram of Grades

005

10

15

20

25

30

40 50 60 70 80 90Grade

Rel

ativ

e fr

eque

ncy

100

Based on the histo-gram about what percent of the values are between 475 and 525

1 50

2 5

3 17

4 30

Stem and leaf displays Have the following general appearance

stem leaf

1 8 9

2 1 2 8 9 9

3 2 3 8 9

4 0 1

5 6 7

6 4

Example employee ages at a small company

18 21 22 19 32 33 40 41 56 57 64 28 29 29 38 39 stem 10rsquos digit leaf 1rsquos digit

18 stem=1 leaf=8 18 = 1 | 8

stem leaf

1 8 9

2 1 2 8 9 9

3 2 3 8 9

4 0 1

5 6 7

6 4

Suppose a 95 yr old is hiredstem leaf

1 8 9

2 1 2 8 9 9

3 2 3 8 9

4 0 1

5 6 7

6 4

7

8

9 5

Number of TD passes by NFL teams 2012-2013 season(stems are 10rsquos digit)

stem leaf

43

03247

2 6677789

2 01222233444

1 13467889

0 8

Pulse Rates n = 138

Stem Leaves 4 3 4 588 9 5 001233444 10 5 5556788899 23 6 00011111122233333344444 23 6 55556666667777788888888 16 7 00000112222334444 23 7 55555666666777888888999 10 8 0000112224 10 8 5555667789 4 9 0012 2 9 58 4 10 0223 10 1 11 1

AdvantagesDisadvantages of Stem-and-Leaf Displays

Advantages

1) each measurement displayed

2) ascending order in each stem row

3) relatively simple (data set not too large) Disadvantages

display becomes unwieldy for large data sets

Population of 185 US cities with between 100000 and 500000

Multiply stems by 100000

Back-to-back stem-and-leaf displays TD passes by NFL teams 1999-2000 2012-13multiply stems by 10

1999-2000 2012-13

2 4 03

6 3 7

2 3 24

6655 2 6677789

43322221100 2 01222233444

9998887666 1 67889

421 1 134

0 8

Below is a stem-and-leaf display for the pulse rates of 24 women at a health clinic How many pulses are between 67 and 77

Stems are 10rsquos digits

1 4

2 6

3 8

4 10

5 12

Other Graphical Methods for Data Time plots

plot observations in time order time on horizontal axis variable on vertical axis

Time series

measurements are taken at regular intervals (monthly unemployment quarterly GDP weather records electricity demand etc)

Heat maps word walls

Unemployment Rate by Educational Attainment

Water Use During Super Bowl XLV(Packers 31 Steelers 25)

Heat Maps

Word Wall (customer feedback)

Section 32Describing the Center of Data

Mean

Median

2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability (next section)

measures how ldquospread outrdquo the data is

Notation for Data Valuesand Sample Mean

1 2

1 2

3

The sample size is denoted by

For a variable denoted by its observations are denoted by

A common measure of center is the sample mean

The sample mean is denoted by

Shorte

n

n

y y yy

n

y

y y y y

y

n

1 21

1

ned expression for using the symbol

(uppercase Greek letter sigma)n

n

i

i n

i

i

y

y y y

yy

n

y

Simple Example of Sample Mean

Weekly TV viewing time in hours of 7 randomly selected 4th graders

19 40 16 12 10 6 and 97

1

7

1

19 40 16 12 10 6 9 112

11216

7 7

ii

ii

y

yy

Population Mean

1

population

population mea

Denoted by the Greek letter

is the size (for example =34000 for NCSU)

the value of is typically not known

we often use the sample mean

to estimat

n

e the unknown

N

ii

y

N N

y

N

value of

Connection Between Mean and Histogram

A histogram balances when supported at the mean Mean x = 1406

Histogram

0

10

20

30

40

50

60

70

118

5

125

5

132

5

139

5

146

5

153

5

16

05

Mo

re

Absences f rom Work

Fre

qu

en

cy

Frequency

The median anothermeasure of center

Given a set of n data values arranged in order of magnitude

Median= middle value n odd

mean of 2 middle values n even

Ex 2 4 6 8 10 n=5 median=6 Ex 2 4 6 8 n=4 median=(4+6)2=5

Student Pulse Rates (n=62)

38 59 60 60 62 62 63 63 64 64 65 67 68 70 70 70 70 70 70 70 71 71 72 72 73 74 74 75 75 75 75 76 77 77 77 77 78 78 79 79 80 80 80 84 84 85 85 87 90 90 91 92 93 94 94 95 96 96 96 98 98 103

Median = (75+76)2 = 755

The median splits the histogram into 2 halves of equal area

Mean balance pointMedian 50 area each half

mean 5526 years median 577years

Medians are used often

Year 2011 baseball salaries

Median $1450000 (max=$32000000 Alex Rodriguez min=$414000)

Median fan age MLB 45 NFL 43 NBA 41 NHL 39

Median existing home sales price May 2011 $166500 May 2010 $174600

Median household income (2008 dollars) 2009 $50221 2008 $52029

Examples Example n = 7

175 28 32 139 141 253 458 Example n = 7 (ordered) 28 32 139 141 175 253 458 Example n = 8

175 28 32 139 141 253 357 458

Example n =8 (ordered)

28 32 139 141 175 253 357 458

m = 141

m = (141+175)2 = 158

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496049604971524555467586

1 5245

2 49655

3 4960

4 4971

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496052455546497155877586

1 5245

2 49655

3 5546

4 4971

Properties of Mean Median1The mean and median are unique that is a

data set has only 1 mean and 1 median (the mean and median are not necessarily equal)

2The mean uses the value of every number in the data set the median does not

14

20 4 6Ex 2 4 6 8 5 5

4 2

21 4 6Ex 2 4 6 9 5 5

4 2

x m

x m

Example class pulse rates

53 64 67 67 70 76 77 77 78 83 84 85 85 89 90 90 90 90 91 96 98 103 140

23

1

23

844823

location 12th obs 85

ii

n

xx

m m

2010 2014 baseball salaries

2010

n = 845

mean = $3297828

median = $1330000

max = $33000000

2014

n = 848

mean = $3932912

median = $1456250

max = $28000000

>

Disadvantage of the mean

Can be greatly influenced by just a few observations that are much greater or much smaller than the rest of the data

Mean Median Maximum Baseball Salaries 1985 - 201419

85

1987

1989

1991

1993

1995

1997

1999

2001

2003

2005

2007

2009

2011

2013

200000

700000

1200000

1700000

2200000

2700000

3200000

3700000

0

5000000

10000000

15000000

20000000

25000000

30000000

35000000

Baseball Salaries Mean Median and Maximum 1985-2014

Mean Median Maximum

Year

Mea

n M

edia

n S

alar

y

Max

imu

m S

alar

y

Skewness comparing the mean and median

Skewed to the right (positively skewed) meangtmedian

53

490

102 7235 21 26 17 8 10 2 3 1 0 0 1

0

100

200

300

400

500

600

Freq

uenc

y

Salary ($1000s)

2011 Baseball Salaries

Skewed to the left negatively skewed

Mean lt median mean=78 median=87

Histogram of Exam Scores

0

10

20

30

20 30 40 50 60 70 80 90 100Exam Scores

Fre

qu

en

cy

Symmetric data

mean median approx equal

Bank Customers 1000-1100 am

0

5

10

15

20

Number of Customers

Fre

qu

en

cy

Section 33Describing Variability of Data

Standard Deviation

Using the Mean and Standard Deviation Together 68-95-997

Rule (Empirical Rule)

Recall 2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability

measures how ldquospread outrdquo the data is

Ways to measure variability

1 range=largest-smallest

ok sometimes in general too crude sensitive to one large or small obs

1

2 where

the middle is the mean

deviation of from the mean

( ) sum the deviations of all the s from

measure spread from the middle

i i

n

i ii

y

y y y

y y y y

1

( ) 0 always tells us nothingn

ii

y y

Example

1 2

1 2

1 2

1 2

sum of deviations from mean

49 51 50

( ) ( ) (49 50) (51 50) 1 1 0

0 100

Data set 1

Data set 2 50

( ) ( ) (0 50) (100 50) 50 50 0

x x x

x x x x

y y y

y y y y

The Sample Standard Deviation a measure of spread around the mean Square the deviation of each

observation from the mean find the square root of the ldquoaveragerdquo of these squared deviations

2

1

2

2 1

( )sample standard deviation

1

( )is called the sample variance

1

n

ii

n

ii

y ys

n

y ys

n

Calculations hellip

Mean = 634

Sum of squared deviations from mean = 852

(n minus 1) = 13 (n minus 1) is called degrees freedom (df)

s2 = variance = 85213 = 655 square inches

s = standard deviation = radic655 = 256 inches

Women height (inches)i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

2

1

2 )(1

1xx

ns

n

i

1 First calculate the variance s22 Then take the square root to get the

standard deviation s

2

1

)(1

1xx

ns

n

i

Meanplusmn 1 sd

Wersquoll never calculate these by hand so make sure to know how to get the standard deviation using your calculator Excel or other software

Population Standard Deviation

2

1

Denoted by the lower case Greek letter

is the size (for example =34000 for NCSU)

is the mean

( )population standard deviation

va

po

lue of typically not known

us

pulation

populatio

e

n

N

ii

N N

y

N

s

to estimate value of

Remarks

1 The standard deviation of a set of measurements is an estimate of the likely size of the chance error in a single measurement

Remarks (cont)

2 Note that s and s are always greater than or equal to zero

3 The larger the value of s (or s ) the greater the spread of the data

When does s=0 When does s =0

When all data values are the same

Remarks (cont)4 The standard deviation is the most

commonly used measure of risk in finance and businessndash Stocks Mutual Funds etc

5 Variance s2 sample variance 2 population variance Units are squared units of the original data square $ square gallons

Review Properties of s and s s and s are always greater than or

equal to 0

when does s = 0 s = 0 The larger the value of s (or s) the

greater the spread of the data the standard deviation of a set of

measurements is an estimate of the likely size of the chance error in a single measurement

Summary of Notation

2

SAMPLE

sample mean

sample median

sample variance

sample stand dev

y

m

s

s

2

POPULATION

population mean

population median

population variance

population stand dev

m

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

68-95-997 rule

Mean andStandard Deviation

(numerical)

Histogram(graphical)

68-95-997 rule

The 68-95-997 ruleIf the histogram of the data is

approximately bell-shaped then1) approximately of the measurements

are of the mean

that is in ( )

2) approximately of the measurement

68

within 1 standard deviation

95

within 2 standard deviation

s

are of the meas n

that is

y s y s

almost all

within 3 standard deviation

in ( 2 2 )

3) the measurements

are of the mean

that is in ( 3 3 )

s

y s y s

y s y s

68-95-997 rule 68 within 1 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

68

3434

y-s y y+s

68-95-997 rule 95 within 2 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

95

475 475

y-2s y y+2s

Example textbook costs

37548

4272

50

y

s

n

286 291 307 308 315 316 327328 340 342 346 347 348 348 349354 355 355 360 361 364 367 369371 373 377 380 381 382 385 385387 390 390 397 398 409 409 410418 422 424 425 426 428 433 434437 440 480

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( ) (33276 41820)

32percentage of data values in this interval 64

5068-95-997 rule 68

y s

y s y s

1 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 2 2 ) (29004 46092)

48percentage of data values in this interval 96

5068-95-997 rule 95

y s

y s y s

2 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 3 3 ) (24732 50364)

50percentage of data values in this interval 100

5068-95-997 rule 997

y s

y s y s

3 standard deviation interval about the mean

The best estimate of the standard deviation of the menrsquos weights

displayed in this dotplot is

1 10

2 15

3 20

4 40

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

Preceding slides Next

Z-scores Standardized Data Values

Measures the distance of a number from the mean in units of

the standard deviation

z-score corresponding to y

where

original data value

the sample mean

s the sample standard deviation

the z-score corresponding to

y yz

s

y

y

z y

Exam 1 y1 = 88 s1 = 6 your exam 1 score 91

Exam 2 y2 = 88 s2 = 10 your exam 2 score 92

Which score is better

1

2

91 88 3z 5

6 692 88 4

z 410 10

91 on exam 1 is better than 92 on exam 2

If data has mean and standard deviation

then standardizing a particular value of

indicates how many standard deviations

is above or below the mean

y s

y

y

y

Comparing SAT and ACT Scores

SAT Math Eleanorrsquos score 680

SAT mean =500 sd=100 ACT Math Geraldrsquos score 27

ACT mean=18 sd=6 Eleanorrsquos z-score z=(680-500)100=18 Geraldrsquos z-score z=(27-18)6=15 Eleanorrsquos score is better

Z-scores add to zeroStudentInstitutional Support to Athletic Depts For the 9 Public ACC

Schools 2013 ($ millions)

School Support y - ybar Z-score

Maryland 155 64 179

UVA 131 40 112

Louisville 109 18 050

UNC 92 01 003

VaTech 79 -12 -034

FSU 79 -12 -034

GaTech 71 -20 -056

NCSU 65 -26 -073

Clemson 38 -53 -147

Mean=91000 s=35697

Sum = 0 Sum = 0

Recently the mean tuition at 4-yr public collegesuniversities in the US was $6185 with a standard deviation of $1804 In NC the mean tuition was $4320 What is NCrsquos z-score

1 103

2 -103

3 239

4 1865

5 -1865

Section 34Measures of Position (also called Measures of Relative Standing)

Quartiles

5-Number Summary

Interquartile Range Another Measure of Spread

Boxplots

m = median = 34

Q1= first quartile = 23

Q3= third quartile = 42

1 1 062 2 123 3 164 4 195 5 156 6 217 7 238 6 239 5 2510 4 2811 3 2912 2 3313 1 3414 2 3615 3 3716 4 3817 5 3918 6 4119 7 4220 6 4521 5 4722 4 4923 3 5324 2 5625 1 61

Quartiles Measuring spread by examining the middleThe first quartile Q1 is the value in the

sample that has 25 of the data at or

below it (Q1 is the median of the lower

half of the sorted data)

The third quartile Q3 is the value in the

sample that has 75 of the data at or

below it (Q3 is the median of the upper

half of the sorted data)

Quartiles and median divide data into 4 pieces

Q1 M Q3

14 14 14 14

Quartiles are common measures of spread

httpoirpncsueduiradmit

httpoirpncsueduunivpeer

University of Southern California

Economic Value of College Majors

Rules for Calculating QuartilesStep 1 find the median of all the data (the median divides the data in half)

Step 2a find the median of the lower half this median is Q1Step 2b find the median of the upper half this median is Q3

Importantwhen n is odd include the overall median in both halveswhen n is even do not include the overall median in either half

Example 2 4 6 8 10 12 14 16 18 20 n = 10

Median m = (10+12)2 = 222 = 11

Q1 median of lower half 2 4 6 8 10

Q1 = 6

Q3 median of upper half 12 14 16 18 20

Q3 = 16

11

Pulse Rates n = 138

Stem Leaves4

3 4 5889 5 00123344410 5 555678889923 6 0001111112223333334444423 6 5555666666777778888888816 7 0000011222233444423 7 5555566666677788888899910 8 000011222410 8 55556677894 9 00122 9 584 10 0223

101 11 1

Median mean of pulses in locations 69 amp 70 median= (70+70)2=70

Q1 median of lower half (lower half = 69 smallest pulses) Q1 = pulse in ordered position 35Q1 = 63

Q3 median of upper half (upper half = 69 largest pulses) Q3= pulse in position 35 from the high end Q3=78

Below are the weights of 31 linemen on the NCSU football team What is the

value of the first quartile Q1

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 287

2 2575

3 2635

4 2625

Interquartile range another measure of spread

lower quartile Q1

middle quartile median upper quartile Q3

interquartile range (IQR)

IQR = Q3 ndash Q1

measures spread of middle 50 of the data

Example beginning pulse rates

Q3 = 78 Q1 = 63

IQR = 78 ndash 63 = 15

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 6: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Example Top 10 causes of death in the United States

Rank Causes of death Counts of top 10s

of total deaths

1 Heart disease 700142 37 28

2 Cancer 553768 29 22

3 Cerebrovascular 163538 9 6

4 Chronic respiratory 123013 6 5

5 Accidents 101537 5 4

6 Diabetes mellitus 71372 4 3

7 Flu and pneumonia 62034 3 2

8 Alzheimerrsquos disease 53852 3 2

9 Kidney disorders 39480 2 2

10 Septicemia 32238 2 1

All other causes 629967 25

For each individual who died in the United States we record what was the

cause of death The table above is a summary of that information

0100200300400500600700800

Counts

(x1000)

Top 10 causes of deaths in the United States

Top 10 causes of death bar graphEach category is represented by one bar The barrsquos height shows the count (or

sometimes the percentage) for that particular category

The number of individuals who died of an accident in is approximately 100000

0100200300400500600700800

Counts

(x1000)

Bar graph sorted by rank Easy to analyze

Top 10 causes of deaths in the United States

0100200300400500600700800

Cou

nts

(x10

00)

Sorted alphabetically Much less useful

1 United States $1582 China $6443 Japan $544 Germany $2445 Britain $2356 France $1937 Brazil $1428 Italy $1319 Australia $12810 India $119

1 United States $13792 Japan $2343 Germany $204 Britain $1685 France $1266 Canada $737 Italy $638 China $54 9 Netherlands $5410 Australia $48

Recent Annual Software Sales ($billions)Recent Annual Computer Hardware Sales ($billion)

NY Times

Percent of people dying fromtop 10 causes of death in the United States

Top 10 causes of death pie chartEach slice represents a piece of one whole The size of a slice depends on what

percent of the whole this category represents

Percent of deaths from top 10 causes

Percent of deaths from

all causes

Make sure your labels match

the data

Make sure all percents

add up to 100

Internships

Basic bar chart Side-by-side bar chart

Trend Student Debt by State (grads of public 4 yr or more)

NewHam

pshir

e

Delawar

e

Minn

esot

a

South

Caroli

na

Alabam

a

Illino

is

Mon

tana

NewJe

rsey

India

na

Wes

tVirg

inia

Wisc

onsin

Idah

o

Kansa

s

Arkan

sas

Kentu

cky

Ore

gon

Nebra

ska

Colora

do

North

Caroli

na

Wyo

ming

Was

hingt

on

Florida

NewYor

k

Okla

hom

a

Califo

rnia

0

5000

10000

15000

20000

25000

30000

35000

40000

2009-10 2012-13 National Average2009-10 $216042012-13 $25043

Campbell University IncNew Life Theological Seminary

Meredith CollegeMid-Atlantic Christian University

Wake Forest UniversityMethodist University

Johnson C Smith UniversityChowan University

Catawba CollegeMars Hill College

Elon UniversityWingate University

Lenoir-Rhyne UniversityDavidson College

St Andrews Presbyterian CollegeDuke University

Belmont Abbey CollegeMean North Carolina - 4-year or above

Brevard CollegeWarren Wilson College

Mount Olive CollegeSalem College

Saint Augustines CollegeHigh Point University

0 20000 40000 60000

North Carolina Private Schools

Tuition and fees (in-state) Average debt of graduates

UNC Greensboro

UNC School of the Arts

NC A amp T

Mean North Carolina - 4-year or above

NCSU

UNC-Wilmington

UNC Charlotte

ECU

Appalachian

UNC Asheville

Elizabeth City

0 5000 10000 15000 20000 25000

North Carolina Public Schools

Tuition and fees (in-state) Average debt of graduates

Student Debt North Carolina Schools

Unnecessary dimension in a pie chart

3rd dimension is unnecessary the 3D pie chart does not convey any more information than a 2D pie chart

Section 31 continuedDisplaying Quantitative Data

Histograms

Stem and Leaf Displays

Frequency HistogramsBAKER CITY HOSPITAL - LENGTH OF STAY

DISTRIBUTION

0

10

20

30

40

50

60

70

0lt2 2lt4 4lt6 6lt8 8lt10 10lt12 12lt14 14lt16 16lt18

Relative Frequency Histogram of Exam Grades

005

10

15

20

25

30

40 50 60 70 80 90Grade

Rel

ativ

e fr

eque

ncy

100

Histograms

A histogram shows three general types of information

It provides visual indication of where the approximate center of the data is

We can gain an understanding of the degree of spread or variation in the data

We can observe the shape of the distribution

Histograms Showing Different Centers

0

10

20

30

40

50

60

70

0lt2 2lt4 4lt6 6lt8 8lt10 10lt12 12lt14 14lt16 16lt18

0

10

20

30

40

50

60

70

0lt2 2lt4 4lt6 6lt8 8lt10 10lt12 12lt14 14lt16 16lt18

Histograms - Same Center Different Spread

0

10

20

30

40

50

60

70

0lt2

2lt4

4lt6

6lt8

8lt10

10lt12

12lt14

14lt16

16lt18

0

10

20

30

40

50

60

70

0lt2 2lt4 4lt6 6lt8 8lt10 10lt12 12lt14 14lt16 16lt18

Histograms Shape

A distribution is symmetric if the right and left

sides of the histogram are approximately mirror

images of each other

Symmetric distribution

Complex multimodal distribution

Not all distributions have a simple overall shape

especially when there are few observations

Skewed distribution

A distribution is skewed to the right if the right

side of the histogram (side with larger values)

extends much farther out than the left side It is

skewed to the left if the left side of the histogram

extends much farther out than the right side

Shape (cont)Female heart attack patients in New York state

Age left-skewed Cost right-skewed

Shape (cont) outliersAll 200 m Races 202 secs or less

192 1926193219381944 195 1956196219681974 198 1986199219982004 201 20160

10

20

30

40

50

60

200 m Races 202 secs or less (approx 700)

TIMES

Fre

qu

ency Usain Bolt

2008 1930Michael Johnson1996 1932

Alaska Florida

Shape (cont) Outliers

An important kind of deviation is an outlier Outliers are observations

that lie outside the overall pattern of a distribution Always look for

outliers and try to explain them

The overall pattern is fairly

symmetrical except for 2

states clearly not belonging

to the main trend Alaska

and Florida have unusual

representation of the

elderly in their population

A large gap in the

distribution is typically a

sign of an outlier

Excel Example 2012-13 NFL Salaries

3694

80

1273

609

231

2177

738

462

3081

867

692

3985

996

923

4890

126

154

5794

255

385

6698

384

615

7602

513

846

8506

643

077

9410

772

308

1031

4901

54

1121

9030

77

1212

3160

1302

7289

23

1393

1418

46

1483

5547

69

1573

9676

92

1664

3806

15

1754

7935

38

0

100

200

300

400

500

600

700

800

900

1000

Histogram

Bin

Fre

qu

ency

Statcrunch Example 2012-13 NFL Salaries

Heights of Students in Recent Stats Class (Bimodal)

ExampleGrades on a statistics exam

Data

75 66 77 66 64 73 91 65 59 86 61 86 61

58 70 77 80 58 94 78 62 79 83 54 52 45

82 48 67 55

Example-2Frequency Distribution of Grades

Class Limits Frequency40 up to 50

50 up to 60

60 up to 70

70 up to 80

80 up to 90

90 up to 100

Total

2

6

8

7

5

2

30

Example-3 Relative Frequency Distribution of Grades

Class Limits Relative Frequency40 up to 50

50 up to 60

60 up to 70

70 up to 80

80 up to 90

90 up to 100

230 = 067

630 = 200

830 = 267

730 = 233

530 = 167

230 = 067

Relative Frequency Histogram of Grades

005

10

15

20

25

30

40 50 60 70 80 90Grade

Rel

ativ

e fr

eque

ncy

100

Based on the histo-gram about what percent of the values are between 475 and 525

1 50

2 5

3 17

4 30

Stem and leaf displays Have the following general appearance

stem leaf

1 8 9

2 1 2 8 9 9

3 2 3 8 9

4 0 1

5 6 7

6 4

Example employee ages at a small company

18 21 22 19 32 33 40 41 56 57 64 28 29 29 38 39 stem 10rsquos digit leaf 1rsquos digit

18 stem=1 leaf=8 18 = 1 | 8

stem leaf

1 8 9

2 1 2 8 9 9

3 2 3 8 9

4 0 1

5 6 7

6 4

Suppose a 95 yr old is hiredstem leaf

1 8 9

2 1 2 8 9 9

3 2 3 8 9

4 0 1

5 6 7

6 4

7

8

9 5

Number of TD passes by NFL teams 2012-2013 season(stems are 10rsquos digit)

stem leaf

43

03247

2 6677789

2 01222233444

1 13467889

0 8

Pulse Rates n = 138

Stem Leaves 4 3 4 588 9 5 001233444 10 5 5556788899 23 6 00011111122233333344444 23 6 55556666667777788888888 16 7 00000112222334444 23 7 55555666666777888888999 10 8 0000112224 10 8 5555667789 4 9 0012 2 9 58 4 10 0223 10 1 11 1

AdvantagesDisadvantages of Stem-and-Leaf Displays

Advantages

1) each measurement displayed

2) ascending order in each stem row

3) relatively simple (data set not too large) Disadvantages

display becomes unwieldy for large data sets

Population of 185 US cities with between 100000 and 500000

Multiply stems by 100000

Back-to-back stem-and-leaf displays TD passes by NFL teams 1999-2000 2012-13multiply stems by 10

1999-2000 2012-13

2 4 03

6 3 7

2 3 24

6655 2 6677789

43322221100 2 01222233444

9998887666 1 67889

421 1 134

0 8

Below is a stem-and-leaf display for the pulse rates of 24 women at a health clinic How many pulses are between 67 and 77

Stems are 10rsquos digits

1 4

2 6

3 8

4 10

5 12

Other Graphical Methods for Data Time plots

plot observations in time order time on horizontal axis variable on vertical axis

Time series

measurements are taken at regular intervals (monthly unemployment quarterly GDP weather records electricity demand etc)

Heat maps word walls

Unemployment Rate by Educational Attainment

Water Use During Super Bowl XLV(Packers 31 Steelers 25)

Heat Maps

Word Wall (customer feedback)

Section 32Describing the Center of Data

Mean

Median

2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability (next section)

measures how ldquospread outrdquo the data is

Notation for Data Valuesand Sample Mean

1 2

1 2

3

The sample size is denoted by

For a variable denoted by its observations are denoted by

A common measure of center is the sample mean

The sample mean is denoted by

Shorte

n

n

y y yy

n

y

y y y y

y

n

1 21

1

ned expression for using the symbol

(uppercase Greek letter sigma)n

n

i

i n

i

i

y

y y y

yy

n

y

Simple Example of Sample Mean

Weekly TV viewing time in hours of 7 randomly selected 4th graders

19 40 16 12 10 6 and 97

1

7

1

19 40 16 12 10 6 9 112

11216

7 7

ii

ii

y

yy

Population Mean

1

population

population mea

Denoted by the Greek letter

is the size (for example =34000 for NCSU)

the value of is typically not known

we often use the sample mean

to estimat

n

e the unknown

N

ii

y

N N

y

N

value of

Connection Between Mean and Histogram

A histogram balances when supported at the mean Mean x = 1406

Histogram

0

10

20

30

40

50

60

70

118

5

125

5

132

5

139

5

146

5

153

5

16

05

Mo

re

Absences f rom Work

Fre

qu

en

cy

Frequency

The median anothermeasure of center

Given a set of n data values arranged in order of magnitude

Median= middle value n odd

mean of 2 middle values n even

Ex 2 4 6 8 10 n=5 median=6 Ex 2 4 6 8 n=4 median=(4+6)2=5

Student Pulse Rates (n=62)

38 59 60 60 62 62 63 63 64 64 65 67 68 70 70 70 70 70 70 70 71 71 72 72 73 74 74 75 75 75 75 76 77 77 77 77 78 78 79 79 80 80 80 84 84 85 85 87 90 90 91 92 93 94 94 95 96 96 96 98 98 103

Median = (75+76)2 = 755

The median splits the histogram into 2 halves of equal area

Mean balance pointMedian 50 area each half

mean 5526 years median 577years

Medians are used often

Year 2011 baseball salaries

Median $1450000 (max=$32000000 Alex Rodriguez min=$414000)

Median fan age MLB 45 NFL 43 NBA 41 NHL 39

Median existing home sales price May 2011 $166500 May 2010 $174600

Median household income (2008 dollars) 2009 $50221 2008 $52029

Examples Example n = 7

175 28 32 139 141 253 458 Example n = 7 (ordered) 28 32 139 141 175 253 458 Example n = 8

175 28 32 139 141 253 357 458

Example n =8 (ordered)

28 32 139 141 175 253 357 458

m = 141

m = (141+175)2 = 158

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496049604971524555467586

1 5245

2 49655

3 4960

4 4971

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496052455546497155877586

1 5245

2 49655

3 5546

4 4971

Properties of Mean Median1The mean and median are unique that is a

data set has only 1 mean and 1 median (the mean and median are not necessarily equal)

2The mean uses the value of every number in the data set the median does not

14

20 4 6Ex 2 4 6 8 5 5

4 2

21 4 6Ex 2 4 6 9 5 5

4 2

x m

x m

Example class pulse rates

53 64 67 67 70 76 77 77 78 83 84 85 85 89 90 90 90 90 91 96 98 103 140

23

1

23

844823

location 12th obs 85

ii

n

xx

m m

2010 2014 baseball salaries

2010

n = 845

mean = $3297828

median = $1330000

max = $33000000

2014

n = 848

mean = $3932912

median = $1456250

max = $28000000

>

Disadvantage of the mean

Can be greatly influenced by just a few observations that are much greater or much smaller than the rest of the data

Mean Median Maximum Baseball Salaries 1985 - 201419

85

1987

1989

1991

1993

1995

1997

1999

2001

2003

2005

2007

2009

2011

2013

200000

700000

1200000

1700000

2200000

2700000

3200000

3700000

0

5000000

10000000

15000000

20000000

25000000

30000000

35000000

Baseball Salaries Mean Median and Maximum 1985-2014

Mean Median Maximum

Year

Mea

n M

edia

n S

alar

y

Max

imu

m S

alar

y

Skewness comparing the mean and median

Skewed to the right (positively skewed) meangtmedian

53

490

102 7235 21 26 17 8 10 2 3 1 0 0 1

0

100

200

300

400

500

600

Freq

uenc

y

Salary ($1000s)

2011 Baseball Salaries

Skewed to the left negatively skewed

Mean lt median mean=78 median=87

Histogram of Exam Scores

0

10

20

30

20 30 40 50 60 70 80 90 100Exam Scores

Fre

qu

en

cy

Symmetric data

mean median approx equal

Bank Customers 1000-1100 am

0

5

10

15

20

Number of Customers

Fre

qu

en

cy

Section 33Describing Variability of Data

Standard Deviation

Using the Mean and Standard Deviation Together 68-95-997

Rule (Empirical Rule)

Recall 2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability

measures how ldquospread outrdquo the data is

Ways to measure variability

1 range=largest-smallest

ok sometimes in general too crude sensitive to one large or small obs

1

2 where

the middle is the mean

deviation of from the mean

( ) sum the deviations of all the s from

measure spread from the middle

i i

n

i ii

y

y y y

y y y y

1

( ) 0 always tells us nothingn

ii

y y

Example

1 2

1 2

1 2

1 2

sum of deviations from mean

49 51 50

( ) ( ) (49 50) (51 50) 1 1 0

0 100

Data set 1

Data set 2 50

( ) ( ) (0 50) (100 50) 50 50 0

x x x

x x x x

y y y

y y y y

The Sample Standard Deviation a measure of spread around the mean Square the deviation of each

observation from the mean find the square root of the ldquoaveragerdquo of these squared deviations

2

1

2

2 1

( )sample standard deviation

1

( )is called the sample variance

1

n

ii

n

ii

y ys

n

y ys

n

Calculations hellip

Mean = 634

Sum of squared deviations from mean = 852

(n minus 1) = 13 (n minus 1) is called degrees freedom (df)

s2 = variance = 85213 = 655 square inches

s = standard deviation = radic655 = 256 inches

Women height (inches)i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

2

1

2 )(1

1xx

ns

n

i

1 First calculate the variance s22 Then take the square root to get the

standard deviation s

2

1

)(1

1xx

ns

n

i

Meanplusmn 1 sd

Wersquoll never calculate these by hand so make sure to know how to get the standard deviation using your calculator Excel or other software

Population Standard Deviation

2

1

Denoted by the lower case Greek letter

is the size (for example =34000 for NCSU)

is the mean

( )population standard deviation

va

po

lue of typically not known

us

pulation

populatio

e

n

N

ii

N N

y

N

s

to estimate value of

Remarks

1 The standard deviation of a set of measurements is an estimate of the likely size of the chance error in a single measurement

Remarks (cont)

2 Note that s and s are always greater than or equal to zero

3 The larger the value of s (or s ) the greater the spread of the data

When does s=0 When does s =0

When all data values are the same

Remarks (cont)4 The standard deviation is the most

commonly used measure of risk in finance and businessndash Stocks Mutual Funds etc

5 Variance s2 sample variance 2 population variance Units are squared units of the original data square $ square gallons

Review Properties of s and s s and s are always greater than or

equal to 0

when does s = 0 s = 0 The larger the value of s (or s) the

greater the spread of the data the standard deviation of a set of

measurements is an estimate of the likely size of the chance error in a single measurement

Summary of Notation

2

SAMPLE

sample mean

sample median

sample variance

sample stand dev

y

m

s

s

2

POPULATION

population mean

population median

population variance

population stand dev

m

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

68-95-997 rule

Mean andStandard Deviation

(numerical)

Histogram(graphical)

68-95-997 rule

The 68-95-997 ruleIf the histogram of the data is

approximately bell-shaped then1) approximately of the measurements

are of the mean

that is in ( )

2) approximately of the measurement

68

within 1 standard deviation

95

within 2 standard deviation

s

are of the meas n

that is

y s y s

almost all

within 3 standard deviation

in ( 2 2 )

3) the measurements

are of the mean

that is in ( 3 3 )

s

y s y s

y s y s

68-95-997 rule 68 within 1 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

68

3434

y-s y y+s

68-95-997 rule 95 within 2 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

95

475 475

y-2s y y+2s

Example textbook costs

37548

4272

50

y

s

n

286 291 307 308 315 316 327328 340 342 346 347 348 348 349354 355 355 360 361 364 367 369371 373 377 380 381 382 385 385387 390 390 397 398 409 409 410418 422 424 425 426 428 433 434437 440 480

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( ) (33276 41820)

32percentage of data values in this interval 64

5068-95-997 rule 68

y s

y s y s

1 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 2 2 ) (29004 46092)

48percentage of data values in this interval 96

5068-95-997 rule 95

y s

y s y s

2 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 3 3 ) (24732 50364)

50percentage of data values in this interval 100

5068-95-997 rule 997

y s

y s y s

3 standard deviation interval about the mean

The best estimate of the standard deviation of the menrsquos weights

displayed in this dotplot is

1 10

2 15

3 20

4 40

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

Preceding slides Next

Z-scores Standardized Data Values

Measures the distance of a number from the mean in units of

the standard deviation

z-score corresponding to y

where

original data value

the sample mean

s the sample standard deviation

the z-score corresponding to

y yz

s

y

y

z y

Exam 1 y1 = 88 s1 = 6 your exam 1 score 91

Exam 2 y2 = 88 s2 = 10 your exam 2 score 92

Which score is better

1

2

91 88 3z 5

6 692 88 4

z 410 10

91 on exam 1 is better than 92 on exam 2

If data has mean and standard deviation

then standardizing a particular value of

indicates how many standard deviations

is above or below the mean

y s

y

y

y

Comparing SAT and ACT Scores

SAT Math Eleanorrsquos score 680

SAT mean =500 sd=100 ACT Math Geraldrsquos score 27

ACT mean=18 sd=6 Eleanorrsquos z-score z=(680-500)100=18 Geraldrsquos z-score z=(27-18)6=15 Eleanorrsquos score is better

Z-scores add to zeroStudentInstitutional Support to Athletic Depts For the 9 Public ACC

Schools 2013 ($ millions)

School Support y - ybar Z-score

Maryland 155 64 179

UVA 131 40 112

Louisville 109 18 050

UNC 92 01 003

VaTech 79 -12 -034

FSU 79 -12 -034

GaTech 71 -20 -056

NCSU 65 -26 -073

Clemson 38 -53 -147

Mean=91000 s=35697

Sum = 0 Sum = 0

Recently the mean tuition at 4-yr public collegesuniversities in the US was $6185 with a standard deviation of $1804 In NC the mean tuition was $4320 What is NCrsquos z-score

1 103

2 -103

3 239

4 1865

5 -1865

Section 34Measures of Position (also called Measures of Relative Standing)

Quartiles

5-Number Summary

Interquartile Range Another Measure of Spread

Boxplots

m = median = 34

Q1= first quartile = 23

Q3= third quartile = 42

1 1 062 2 123 3 164 4 195 5 156 6 217 7 238 6 239 5 2510 4 2811 3 2912 2 3313 1 3414 2 3615 3 3716 4 3817 5 3918 6 4119 7 4220 6 4521 5 4722 4 4923 3 5324 2 5625 1 61

Quartiles Measuring spread by examining the middleThe first quartile Q1 is the value in the

sample that has 25 of the data at or

below it (Q1 is the median of the lower

half of the sorted data)

The third quartile Q3 is the value in the

sample that has 75 of the data at or

below it (Q3 is the median of the upper

half of the sorted data)

Quartiles and median divide data into 4 pieces

Q1 M Q3

14 14 14 14

Quartiles are common measures of spread

httpoirpncsueduiradmit

httpoirpncsueduunivpeer

University of Southern California

Economic Value of College Majors

Rules for Calculating QuartilesStep 1 find the median of all the data (the median divides the data in half)

Step 2a find the median of the lower half this median is Q1Step 2b find the median of the upper half this median is Q3

Importantwhen n is odd include the overall median in both halveswhen n is even do not include the overall median in either half

Example 2 4 6 8 10 12 14 16 18 20 n = 10

Median m = (10+12)2 = 222 = 11

Q1 median of lower half 2 4 6 8 10

Q1 = 6

Q3 median of upper half 12 14 16 18 20

Q3 = 16

11

Pulse Rates n = 138

Stem Leaves4

3 4 5889 5 00123344410 5 555678889923 6 0001111112223333334444423 6 5555666666777778888888816 7 0000011222233444423 7 5555566666677788888899910 8 000011222410 8 55556677894 9 00122 9 584 10 0223

101 11 1

Median mean of pulses in locations 69 amp 70 median= (70+70)2=70

Q1 median of lower half (lower half = 69 smallest pulses) Q1 = pulse in ordered position 35Q1 = 63

Q3 median of upper half (upper half = 69 largest pulses) Q3= pulse in position 35 from the high end Q3=78

Below are the weights of 31 linemen on the NCSU football team What is the

value of the first quartile Q1

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 287

2 2575

3 2635

4 2625

Interquartile range another measure of spread

lower quartile Q1

middle quartile median upper quartile Q3

interquartile range (IQR)

IQR = Q3 ndash Q1

measures spread of middle 50 of the data

Example beginning pulse rates

Q3 = 78 Q1 = 63

IQR = 78 ndash 63 = 15

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 7: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

0100200300400500600700800

Counts

(x1000)

Top 10 causes of deaths in the United States

Top 10 causes of death bar graphEach category is represented by one bar The barrsquos height shows the count (or

sometimes the percentage) for that particular category

The number of individuals who died of an accident in is approximately 100000

0100200300400500600700800

Counts

(x1000)

Bar graph sorted by rank Easy to analyze

Top 10 causes of deaths in the United States

0100200300400500600700800

Cou

nts

(x10

00)

Sorted alphabetically Much less useful

1 United States $1582 China $6443 Japan $544 Germany $2445 Britain $2356 France $1937 Brazil $1428 Italy $1319 Australia $12810 India $119

1 United States $13792 Japan $2343 Germany $204 Britain $1685 France $1266 Canada $737 Italy $638 China $54 9 Netherlands $5410 Australia $48

Recent Annual Software Sales ($billions)Recent Annual Computer Hardware Sales ($billion)

NY Times

Percent of people dying fromtop 10 causes of death in the United States

Top 10 causes of death pie chartEach slice represents a piece of one whole The size of a slice depends on what

percent of the whole this category represents

Percent of deaths from top 10 causes

Percent of deaths from

all causes

Make sure your labels match

the data

Make sure all percents

add up to 100

Internships

Basic bar chart Side-by-side bar chart

Trend Student Debt by State (grads of public 4 yr or more)

NewHam

pshir

e

Delawar

e

Minn

esot

a

South

Caroli

na

Alabam

a

Illino

is

Mon

tana

NewJe

rsey

India

na

Wes

tVirg

inia

Wisc

onsin

Idah

o

Kansa

s

Arkan

sas

Kentu

cky

Ore

gon

Nebra

ska

Colora

do

North

Caroli

na

Wyo

ming

Was

hingt

on

Florida

NewYor

k

Okla

hom

a

Califo

rnia

0

5000

10000

15000

20000

25000

30000

35000

40000

2009-10 2012-13 National Average2009-10 $216042012-13 $25043

Campbell University IncNew Life Theological Seminary

Meredith CollegeMid-Atlantic Christian University

Wake Forest UniversityMethodist University

Johnson C Smith UniversityChowan University

Catawba CollegeMars Hill College

Elon UniversityWingate University

Lenoir-Rhyne UniversityDavidson College

St Andrews Presbyterian CollegeDuke University

Belmont Abbey CollegeMean North Carolina - 4-year or above

Brevard CollegeWarren Wilson College

Mount Olive CollegeSalem College

Saint Augustines CollegeHigh Point University

0 20000 40000 60000

North Carolina Private Schools

Tuition and fees (in-state) Average debt of graduates

UNC Greensboro

UNC School of the Arts

NC A amp T

Mean North Carolina - 4-year or above

NCSU

UNC-Wilmington

UNC Charlotte

ECU

Appalachian

UNC Asheville

Elizabeth City

0 5000 10000 15000 20000 25000

North Carolina Public Schools

Tuition and fees (in-state) Average debt of graduates

Student Debt North Carolina Schools

Unnecessary dimension in a pie chart

3rd dimension is unnecessary the 3D pie chart does not convey any more information than a 2D pie chart

Section 31 continuedDisplaying Quantitative Data

Histograms

Stem and Leaf Displays

Frequency HistogramsBAKER CITY HOSPITAL - LENGTH OF STAY

DISTRIBUTION

0

10

20

30

40

50

60

70

0lt2 2lt4 4lt6 6lt8 8lt10 10lt12 12lt14 14lt16 16lt18

Relative Frequency Histogram of Exam Grades

005

10

15

20

25

30

40 50 60 70 80 90Grade

Rel

ativ

e fr

eque

ncy

100

Histograms

A histogram shows three general types of information

It provides visual indication of where the approximate center of the data is

We can gain an understanding of the degree of spread or variation in the data

We can observe the shape of the distribution

Histograms Showing Different Centers

0

10

20

30

40

50

60

70

0lt2 2lt4 4lt6 6lt8 8lt10 10lt12 12lt14 14lt16 16lt18

0

10

20

30

40

50

60

70

0lt2 2lt4 4lt6 6lt8 8lt10 10lt12 12lt14 14lt16 16lt18

Histograms - Same Center Different Spread

0

10

20

30

40

50

60

70

0lt2

2lt4

4lt6

6lt8

8lt10

10lt12

12lt14

14lt16

16lt18

0

10

20

30

40

50

60

70

0lt2 2lt4 4lt6 6lt8 8lt10 10lt12 12lt14 14lt16 16lt18

Histograms Shape

A distribution is symmetric if the right and left

sides of the histogram are approximately mirror

images of each other

Symmetric distribution

Complex multimodal distribution

Not all distributions have a simple overall shape

especially when there are few observations

Skewed distribution

A distribution is skewed to the right if the right

side of the histogram (side with larger values)

extends much farther out than the left side It is

skewed to the left if the left side of the histogram

extends much farther out than the right side

Shape (cont)Female heart attack patients in New York state

Age left-skewed Cost right-skewed

Shape (cont) outliersAll 200 m Races 202 secs or less

192 1926193219381944 195 1956196219681974 198 1986199219982004 201 20160

10

20

30

40

50

60

200 m Races 202 secs or less (approx 700)

TIMES

Fre

qu

ency Usain Bolt

2008 1930Michael Johnson1996 1932

Alaska Florida

Shape (cont) Outliers

An important kind of deviation is an outlier Outliers are observations

that lie outside the overall pattern of a distribution Always look for

outliers and try to explain them

The overall pattern is fairly

symmetrical except for 2

states clearly not belonging

to the main trend Alaska

and Florida have unusual

representation of the

elderly in their population

A large gap in the

distribution is typically a

sign of an outlier

Excel Example 2012-13 NFL Salaries

3694

80

1273

609

231

2177

738

462

3081

867

692

3985

996

923

4890

126

154

5794

255

385

6698

384

615

7602

513

846

8506

643

077

9410

772

308

1031

4901

54

1121

9030

77

1212

3160

1302

7289

23

1393

1418

46

1483

5547

69

1573

9676

92

1664

3806

15

1754

7935

38

0

100

200

300

400

500

600

700

800

900

1000

Histogram

Bin

Fre

qu

ency

Statcrunch Example 2012-13 NFL Salaries

Heights of Students in Recent Stats Class (Bimodal)

ExampleGrades on a statistics exam

Data

75 66 77 66 64 73 91 65 59 86 61 86 61

58 70 77 80 58 94 78 62 79 83 54 52 45

82 48 67 55

Example-2Frequency Distribution of Grades

Class Limits Frequency40 up to 50

50 up to 60

60 up to 70

70 up to 80

80 up to 90

90 up to 100

Total

2

6

8

7

5

2

30

Example-3 Relative Frequency Distribution of Grades

Class Limits Relative Frequency40 up to 50

50 up to 60

60 up to 70

70 up to 80

80 up to 90

90 up to 100

230 = 067

630 = 200

830 = 267

730 = 233

530 = 167

230 = 067

Relative Frequency Histogram of Grades

005

10

15

20

25

30

40 50 60 70 80 90Grade

Rel

ativ

e fr

eque

ncy

100

Based on the histo-gram about what percent of the values are between 475 and 525

1 50

2 5

3 17

4 30

Stem and leaf displays Have the following general appearance

stem leaf

1 8 9

2 1 2 8 9 9

3 2 3 8 9

4 0 1

5 6 7

6 4

Example employee ages at a small company

18 21 22 19 32 33 40 41 56 57 64 28 29 29 38 39 stem 10rsquos digit leaf 1rsquos digit

18 stem=1 leaf=8 18 = 1 | 8

stem leaf

1 8 9

2 1 2 8 9 9

3 2 3 8 9

4 0 1

5 6 7

6 4

Suppose a 95 yr old is hiredstem leaf

1 8 9

2 1 2 8 9 9

3 2 3 8 9

4 0 1

5 6 7

6 4

7

8

9 5

Number of TD passes by NFL teams 2012-2013 season(stems are 10rsquos digit)

stem leaf

43

03247

2 6677789

2 01222233444

1 13467889

0 8

Pulse Rates n = 138

Stem Leaves 4 3 4 588 9 5 001233444 10 5 5556788899 23 6 00011111122233333344444 23 6 55556666667777788888888 16 7 00000112222334444 23 7 55555666666777888888999 10 8 0000112224 10 8 5555667789 4 9 0012 2 9 58 4 10 0223 10 1 11 1

AdvantagesDisadvantages of Stem-and-Leaf Displays

Advantages

1) each measurement displayed

2) ascending order in each stem row

3) relatively simple (data set not too large) Disadvantages

display becomes unwieldy for large data sets

Population of 185 US cities with between 100000 and 500000

Multiply stems by 100000

Back-to-back stem-and-leaf displays TD passes by NFL teams 1999-2000 2012-13multiply stems by 10

1999-2000 2012-13

2 4 03

6 3 7

2 3 24

6655 2 6677789

43322221100 2 01222233444

9998887666 1 67889

421 1 134

0 8

Below is a stem-and-leaf display for the pulse rates of 24 women at a health clinic How many pulses are between 67 and 77

Stems are 10rsquos digits

1 4

2 6

3 8

4 10

5 12

Other Graphical Methods for Data Time plots

plot observations in time order time on horizontal axis variable on vertical axis

Time series

measurements are taken at regular intervals (monthly unemployment quarterly GDP weather records electricity demand etc)

Heat maps word walls

Unemployment Rate by Educational Attainment

Water Use During Super Bowl XLV(Packers 31 Steelers 25)

Heat Maps

Word Wall (customer feedback)

Section 32Describing the Center of Data

Mean

Median

2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability (next section)

measures how ldquospread outrdquo the data is

Notation for Data Valuesand Sample Mean

1 2

1 2

3

The sample size is denoted by

For a variable denoted by its observations are denoted by

A common measure of center is the sample mean

The sample mean is denoted by

Shorte

n

n

y y yy

n

y

y y y y

y

n

1 21

1

ned expression for using the symbol

(uppercase Greek letter sigma)n

n

i

i n

i

i

y

y y y

yy

n

y

Simple Example of Sample Mean

Weekly TV viewing time in hours of 7 randomly selected 4th graders

19 40 16 12 10 6 and 97

1

7

1

19 40 16 12 10 6 9 112

11216

7 7

ii

ii

y

yy

Population Mean

1

population

population mea

Denoted by the Greek letter

is the size (for example =34000 for NCSU)

the value of is typically not known

we often use the sample mean

to estimat

n

e the unknown

N

ii

y

N N

y

N

value of

Connection Between Mean and Histogram

A histogram balances when supported at the mean Mean x = 1406

Histogram

0

10

20

30

40

50

60

70

118

5

125

5

132

5

139

5

146

5

153

5

16

05

Mo

re

Absences f rom Work

Fre

qu

en

cy

Frequency

The median anothermeasure of center

Given a set of n data values arranged in order of magnitude

Median= middle value n odd

mean of 2 middle values n even

Ex 2 4 6 8 10 n=5 median=6 Ex 2 4 6 8 n=4 median=(4+6)2=5

Student Pulse Rates (n=62)

38 59 60 60 62 62 63 63 64 64 65 67 68 70 70 70 70 70 70 70 71 71 72 72 73 74 74 75 75 75 75 76 77 77 77 77 78 78 79 79 80 80 80 84 84 85 85 87 90 90 91 92 93 94 94 95 96 96 96 98 98 103

Median = (75+76)2 = 755

The median splits the histogram into 2 halves of equal area

Mean balance pointMedian 50 area each half

mean 5526 years median 577years

Medians are used often

Year 2011 baseball salaries

Median $1450000 (max=$32000000 Alex Rodriguez min=$414000)

Median fan age MLB 45 NFL 43 NBA 41 NHL 39

Median existing home sales price May 2011 $166500 May 2010 $174600

Median household income (2008 dollars) 2009 $50221 2008 $52029

Examples Example n = 7

175 28 32 139 141 253 458 Example n = 7 (ordered) 28 32 139 141 175 253 458 Example n = 8

175 28 32 139 141 253 357 458

Example n =8 (ordered)

28 32 139 141 175 253 357 458

m = 141

m = (141+175)2 = 158

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496049604971524555467586

1 5245

2 49655

3 4960

4 4971

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496052455546497155877586

1 5245

2 49655

3 5546

4 4971

Properties of Mean Median1The mean and median are unique that is a

data set has only 1 mean and 1 median (the mean and median are not necessarily equal)

2The mean uses the value of every number in the data set the median does not

14

20 4 6Ex 2 4 6 8 5 5

4 2

21 4 6Ex 2 4 6 9 5 5

4 2

x m

x m

Example class pulse rates

53 64 67 67 70 76 77 77 78 83 84 85 85 89 90 90 90 90 91 96 98 103 140

23

1

23

844823

location 12th obs 85

ii

n

xx

m m

2010 2014 baseball salaries

2010

n = 845

mean = $3297828

median = $1330000

max = $33000000

2014

n = 848

mean = $3932912

median = $1456250

max = $28000000

>

Disadvantage of the mean

Can be greatly influenced by just a few observations that are much greater or much smaller than the rest of the data

Mean Median Maximum Baseball Salaries 1985 - 201419

85

1987

1989

1991

1993

1995

1997

1999

2001

2003

2005

2007

2009

2011

2013

200000

700000

1200000

1700000

2200000

2700000

3200000

3700000

0

5000000

10000000

15000000

20000000

25000000

30000000

35000000

Baseball Salaries Mean Median and Maximum 1985-2014

Mean Median Maximum

Year

Mea

n M

edia

n S

alar

y

Max

imu

m S

alar

y

Skewness comparing the mean and median

Skewed to the right (positively skewed) meangtmedian

53

490

102 7235 21 26 17 8 10 2 3 1 0 0 1

0

100

200

300

400

500

600

Freq

uenc

y

Salary ($1000s)

2011 Baseball Salaries

Skewed to the left negatively skewed

Mean lt median mean=78 median=87

Histogram of Exam Scores

0

10

20

30

20 30 40 50 60 70 80 90 100Exam Scores

Fre

qu

en

cy

Symmetric data

mean median approx equal

Bank Customers 1000-1100 am

0

5

10

15

20

Number of Customers

Fre

qu

en

cy

Section 33Describing Variability of Data

Standard Deviation

Using the Mean and Standard Deviation Together 68-95-997

Rule (Empirical Rule)

Recall 2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability

measures how ldquospread outrdquo the data is

Ways to measure variability

1 range=largest-smallest

ok sometimes in general too crude sensitive to one large or small obs

1

2 where

the middle is the mean

deviation of from the mean

( ) sum the deviations of all the s from

measure spread from the middle

i i

n

i ii

y

y y y

y y y y

1

( ) 0 always tells us nothingn

ii

y y

Example

1 2

1 2

1 2

1 2

sum of deviations from mean

49 51 50

( ) ( ) (49 50) (51 50) 1 1 0

0 100

Data set 1

Data set 2 50

( ) ( ) (0 50) (100 50) 50 50 0

x x x

x x x x

y y y

y y y y

The Sample Standard Deviation a measure of spread around the mean Square the deviation of each

observation from the mean find the square root of the ldquoaveragerdquo of these squared deviations

2

1

2

2 1

( )sample standard deviation

1

( )is called the sample variance

1

n

ii

n

ii

y ys

n

y ys

n

Calculations hellip

Mean = 634

Sum of squared deviations from mean = 852

(n minus 1) = 13 (n minus 1) is called degrees freedom (df)

s2 = variance = 85213 = 655 square inches

s = standard deviation = radic655 = 256 inches

Women height (inches)i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

2

1

2 )(1

1xx

ns

n

i

1 First calculate the variance s22 Then take the square root to get the

standard deviation s

2

1

)(1

1xx

ns

n

i

Meanplusmn 1 sd

Wersquoll never calculate these by hand so make sure to know how to get the standard deviation using your calculator Excel or other software

Population Standard Deviation

2

1

Denoted by the lower case Greek letter

is the size (for example =34000 for NCSU)

is the mean

( )population standard deviation

va

po

lue of typically not known

us

pulation

populatio

e

n

N

ii

N N

y

N

s

to estimate value of

Remarks

1 The standard deviation of a set of measurements is an estimate of the likely size of the chance error in a single measurement

Remarks (cont)

2 Note that s and s are always greater than or equal to zero

3 The larger the value of s (or s ) the greater the spread of the data

When does s=0 When does s =0

When all data values are the same

Remarks (cont)4 The standard deviation is the most

commonly used measure of risk in finance and businessndash Stocks Mutual Funds etc

5 Variance s2 sample variance 2 population variance Units are squared units of the original data square $ square gallons

Review Properties of s and s s and s are always greater than or

equal to 0

when does s = 0 s = 0 The larger the value of s (or s) the

greater the spread of the data the standard deviation of a set of

measurements is an estimate of the likely size of the chance error in a single measurement

Summary of Notation

2

SAMPLE

sample mean

sample median

sample variance

sample stand dev

y

m

s

s

2

POPULATION

population mean

population median

population variance

population stand dev

m

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

68-95-997 rule

Mean andStandard Deviation

(numerical)

Histogram(graphical)

68-95-997 rule

The 68-95-997 ruleIf the histogram of the data is

approximately bell-shaped then1) approximately of the measurements

are of the mean

that is in ( )

2) approximately of the measurement

68

within 1 standard deviation

95

within 2 standard deviation

s

are of the meas n

that is

y s y s

almost all

within 3 standard deviation

in ( 2 2 )

3) the measurements

are of the mean

that is in ( 3 3 )

s

y s y s

y s y s

68-95-997 rule 68 within 1 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

68

3434

y-s y y+s

68-95-997 rule 95 within 2 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

95

475 475

y-2s y y+2s

Example textbook costs

37548

4272

50

y

s

n

286 291 307 308 315 316 327328 340 342 346 347 348 348 349354 355 355 360 361 364 367 369371 373 377 380 381 382 385 385387 390 390 397 398 409 409 410418 422 424 425 426 428 433 434437 440 480

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( ) (33276 41820)

32percentage of data values in this interval 64

5068-95-997 rule 68

y s

y s y s

1 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 2 2 ) (29004 46092)

48percentage of data values in this interval 96

5068-95-997 rule 95

y s

y s y s

2 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 3 3 ) (24732 50364)

50percentage of data values in this interval 100

5068-95-997 rule 997

y s

y s y s

3 standard deviation interval about the mean

The best estimate of the standard deviation of the menrsquos weights

displayed in this dotplot is

1 10

2 15

3 20

4 40

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

Preceding slides Next

Z-scores Standardized Data Values

Measures the distance of a number from the mean in units of

the standard deviation

z-score corresponding to y

where

original data value

the sample mean

s the sample standard deviation

the z-score corresponding to

y yz

s

y

y

z y

Exam 1 y1 = 88 s1 = 6 your exam 1 score 91

Exam 2 y2 = 88 s2 = 10 your exam 2 score 92

Which score is better

1

2

91 88 3z 5

6 692 88 4

z 410 10

91 on exam 1 is better than 92 on exam 2

If data has mean and standard deviation

then standardizing a particular value of

indicates how many standard deviations

is above or below the mean

y s

y

y

y

Comparing SAT and ACT Scores

SAT Math Eleanorrsquos score 680

SAT mean =500 sd=100 ACT Math Geraldrsquos score 27

ACT mean=18 sd=6 Eleanorrsquos z-score z=(680-500)100=18 Geraldrsquos z-score z=(27-18)6=15 Eleanorrsquos score is better

Z-scores add to zeroStudentInstitutional Support to Athletic Depts For the 9 Public ACC

Schools 2013 ($ millions)

School Support y - ybar Z-score

Maryland 155 64 179

UVA 131 40 112

Louisville 109 18 050

UNC 92 01 003

VaTech 79 -12 -034

FSU 79 -12 -034

GaTech 71 -20 -056

NCSU 65 -26 -073

Clemson 38 -53 -147

Mean=91000 s=35697

Sum = 0 Sum = 0

Recently the mean tuition at 4-yr public collegesuniversities in the US was $6185 with a standard deviation of $1804 In NC the mean tuition was $4320 What is NCrsquos z-score

1 103

2 -103

3 239

4 1865

5 -1865

Section 34Measures of Position (also called Measures of Relative Standing)

Quartiles

5-Number Summary

Interquartile Range Another Measure of Spread

Boxplots

m = median = 34

Q1= first quartile = 23

Q3= third quartile = 42

1 1 062 2 123 3 164 4 195 5 156 6 217 7 238 6 239 5 2510 4 2811 3 2912 2 3313 1 3414 2 3615 3 3716 4 3817 5 3918 6 4119 7 4220 6 4521 5 4722 4 4923 3 5324 2 5625 1 61

Quartiles Measuring spread by examining the middleThe first quartile Q1 is the value in the

sample that has 25 of the data at or

below it (Q1 is the median of the lower

half of the sorted data)

The third quartile Q3 is the value in the

sample that has 75 of the data at or

below it (Q3 is the median of the upper

half of the sorted data)

Quartiles and median divide data into 4 pieces

Q1 M Q3

14 14 14 14

Quartiles are common measures of spread

httpoirpncsueduiradmit

httpoirpncsueduunivpeer

University of Southern California

Economic Value of College Majors

Rules for Calculating QuartilesStep 1 find the median of all the data (the median divides the data in half)

Step 2a find the median of the lower half this median is Q1Step 2b find the median of the upper half this median is Q3

Importantwhen n is odd include the overall median in both halveswhen n is even do not include the overall median in either half

Example 2 4 6 8 10 12 14 16 18 20 n = 10

Median m = (10+12)2 = 222 = 11

Q1 median of lower half 2 4 6 8 10

Q1 = 6

Q3 median of upper half 12 14 16 18 20

Q3 = 16

11

Pulse Rates n = 138

Stem Leaves4

3 4 5889 5 00123344410 5 555678889923 6 0001111112223333334444423 6 5555666666777778888888816 7 0000011222233444423 7 5555566666677788888899910 8 000011222410 8 55556677894 9 00122 9 584 10 0223

101 11 1

Median mean of pulses in locations 69 amp 70 median= (70+70)2=70

Q1 median of lower half (lower half = 69 smallest pulses) Q1 = pulse in ordered position 35Q1 = 63

Q3 median of upper half (upper half = 69 largest pulses) Q3= pulse in position 35 from the high end Q3=78

Below are the weights of 31 linemen on the NCSU football team What is the

value of the first quartile Q1

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 287

2 2575

3 2635

4 2625

Interquartile range another measure of spread

lower quartile Q1

middle quartile median upper quartile Q3

interquartile range (IQR)

IQR = Q3 ndash Q1

measures spread of middle 50 of the data

Example beginning pulse rates

Q3 = 78 Q1 = 63

IQR = 78 ndash 63 = 15

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 8: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

0100200300400500600700800

Counts

(x1000)

Bar graph sorted by rank Easy to analyze

Top 10 causes of deaths in the United States

0100200300400500600700800

Cou

nts

(x10

00)

Sorted alphabetically Much less useful

1 United States $1582 China $6443 Japan $544 Germany $2445 Britain $2356 France $1937 Brazil $1428 Italy $1319 Australia $12810 India $119

1 United States $13792 Japan $2343 Germany $204 Britain $1685 France $1266 Canada $737 Italy $638 China $54 9 Netherlands $5410 Australia $48

Recent Annual Software Sales ($billions)Recent Annual Computer Hardware Sales ($billion)

NY Times

Percent of people dying fromtop 10 causes of death in the United States

Top 10 causes of death pie chartEach slice represents a piece of one whole The size of a slice depends on what

percent of the whole this category represents

Percent of deaths from top 10 causes

Percent of deaths from

all causes

Make sure your labels match

the data

Make sure all percents

add up to 100

Internships

Basic bar chart Side-by-side bar chart

Trend Student Debt by State (grads of public 4 yr or more)

NewHam

pshir

e

Delawar

e

Minn

esot

a

South

Caroli

na

Alabam

a

Illino

is

Mon

tana

NewJe

rsey

India

na

Wes

tVirg

inia

Wisc

onsin

Idah

o

Kansa

s

Arkan

sas

Kentu

cky

Ore

gon

Nebra

ska

Colora

do

North

Caroli

na

Wyo

ming

Was

hingt

on

Florida

NewYor

k

Okla

hom

a

Califo

rnia

0

5000

10000

15000

20000

25000

30000

35000

40000

2009-10 2012-13 National Average2009-10 $216042012-13 $25043

Campbell University IncNew Life Theological Seminary

Meredith CollegeMid-Atlantic Christian University

Wake Forest UniversityMethodist University

Johnson C Smith UniversityChowan University

Catawba CollegeMars Hill College

Elon UniversityWingate University

Lenoir-Rhyne UniversityDavidson College

St Andrews Presbyterian CollegeDuke University

Belmont Abbey CollegeMean North Carolina - 4-year or above

Brevard CollegeWarren Wilson College

Mount Olive CollegeSalem College

Saint Augustines CollegeHigh Point University

0 20000 40000 60000

North Carolina Private Schools

Tuition and fees (in-state) Average debt of graduates

UNC Greensboro

UNC School of the Arts

NC A amp T

Mean North Carolina - 4-year or above

NCSU

UNC-Wilmington

UNC Charlotte

ECU

Appalachian

UNC Asheville

Elizabeth City

0 5000 10000 15000 20000 25000

North Carolina Public Schools

Tuition and fees (in-state) Average debt of graduates

Student Debt North Carolina Schools

Unnecessary dimension in a pie chart

3rd dimension is unnecessary the 3D pie chart does not convey any more information than a 2D pie chart

Section 31 continuedDisplaying Quantitative Data

Histograms

Stem and Leaf Displays

Frequency HistogramsBAKER CITY HOSPITAL - LENGTH OF STAY

DISTRIBUTION

0

10

20

30

40

50

60

70

0lt2 2lt4 4lt6 6lt8 8lt10 10lt12 12lt14 14lt16 16lt18

Relative Frequency Histogram of Exam Grades

005

10

15

20

25

30

40 50 60 70 80 90Grade

Rel

ativ

e fr

eque

ncy

100

Histograms

A histogram shows three general types of information

It provides visual indication of where the approximate center of the data is

We can gain an understanding of the degree of spread or variation in the data

We can observe the shape of the distribution

Histograms Showing Different Centers

0

10

20

30

40

50

60

70

0lt2 2lt4 4lt6 6lt8 8lt10 10lt12 12lt14 14lt16 16lt18

0

10

20

30

40

50

60

70

0lt2 2lt4 4lt6 6lt8 8lt10 10lt12 12lt14 14lt16 16lt18

Histograms - Same Center Different Spread

0

10

20

30

40

50

60

70

0lt2

2lt4

4lt6

6lt8

8lt10

10lt12

12lt14

14lt16

16lt18

0

10

20

30

40

50

60

70

0lt2 2lt4 4lt6 6lt8 8lt10 10lt12 12lt14 14lt16 16lt18

Histograms Shape

A distribution is symmetric if the right and left

sides of the histogram are approximately mirror

images of each other

Symmetric distribution

Complex multimodal distribution

Not all distributions have a simple overall shape

especially when there are few observations

Skewed distribution

A distribution is skewed to the right if the right

side of the histogram (side with larger values)

extends much farther out than the left side It is

skewed to the left if the left side of the histogram

extends much farther out than the right side

Shape (cont)Female heart attack patients in New York state

Age left-skewed Cost right-skewed

Shape (cont) outliersAll 200 m Races 202 secs or less

192 1926193219381944 195 1956196219681974 198 1986199219982004 201 20160

10

20

30

40

50

60

200 m Races 202 secs or less (approx 700)

TIMES

Fre

qu

ency Usain Bolt

2008 1930Michael Johnson1996 1932

Alaska Florida

Shape (cont) Outliers

An important kind of deviation is an outlier Outliers are observations

that lie outside the overall pattern of a distribution Always look for

outliers and try to explain them

The overall pattern is fairly

symmetrical except for 2

states clearly not belonging

to the main trend Alaska

and Florida have unusual

representation of the

elderly in their population

A large gap in the

distribution is typically a

sign of an outlier

Excel Example 2012-13 NFL Salaries

3694

80

1273

609

231

2177

738

462

3081

867

692

3985

996

923

4890

126

154

5794

255

385

6698

384

615

7602

513

846

8506

643

077

9410

772

308

1031

4901

54

1121

9030

77

1212

3160

1302

7289

23

1393

1418

46

1483

5547

69

1573

9676

92

1664

3806

15

1754

7935

38

0

100

200

300

400

500

600

700

800

900

1000

Histogram

Bin

Fre

qu

ency

Statcrunch Example 2012-13 NFL Salaries

Heights of Students in Recent Stats Class (Bimodal)

ExampleGrades on a statistics exam

Data

75 66 77 66 64 73 91 65 59 86 61 86 61

58 70 77 80 58 94 78 62 79 83 54 52 45

82 48 67 55

Example-2Frequency Distribution of Grades

Class Limits Frequency40 up to 50

50 up to 60

60 up to 70

70 up to 80

80 up to 90

90 up to 100

Total

2

6

8

7

5

2

30

Example-3 Relative Frequency Distribution of Grades

Class Limits Relative Frequency40 up to 50

50 up to 60

60 up to 70

70 up to 80

80 up to 90

90 up to 100

230 = 067

630 = 200

830 = 267

730 = 233

530 = 167

230 = 067

Relative Frequency Histogram of Grades

005

10

15

20

25

30

40 50 60 70 80 90Grade

Rel

ativ

e fr

eque

ncy

100

Based on the histo-gram about what percent of the values are between 475 and 525

1 50

2 5

3 17

4 30

Stem and leaf displays Have the following general appearance

stem leaf

1 8 9

2 1 2 8 9 9

3 2 3 8 9

4 0 1

5 6 7

6 4

Example employee ages at a small company

18 21 22 19 32 33 40 41 56 57 64 28 29 29 38 39 stem 10rsquos digit leaf 1rsquos digit

18 stem=1 leaf=8 18 = 1 | 8

stem leaf

1 8 9

2 1 2 8 9 9

3 2 3 8 9

4 0 1

5 6 7

6 4

Suppose a 95 yr old is hiredstem leaf

1 8 9

2 1 2 8 9 9

3 2 3 8 9

4 0 1

5 6 7

6 4

7

8

9 5

Number of TD passes by NFL teams 2012-2013 season(stems are 10rsquos digit)

stem leaf

43

03247

2 6677789

2 01222233444

1 13467889

0 8

Pulse Rates n = 138

Stem Leaves 4 3 4 588 9 5 001233444 10 5 5556788899 23 6 00011111122233333344444 23 6 55556666667777788888888 16 7 00000112222334444 23 7 55555666666777888888999 10 8 0000112224 10 8 5555667789 4 9 0012 2 9 58 4 10 0223 10 1 11 1

AdvantagesDisadvantages of Stem-and-Leaf Displays

Advantages

1) each measurement displayed

2) ascending order in each stem row

3) relatively simple (data set not too large) Disadvantages

display becomes unwieldy for large data sets

Population of 185 US cities with between 100000 and 500000

Multiply stems by 100000

Back-to-back stem-and-leaf displays TD passes by NFL teams 1999-2000 2012-13multiply stems by 10

1999-2000 2012-13

2 4 03

6 3 7

2 3 24

6655 2 6677789

43322221100 2 01222233444

9998887666 1 67889

421 1 134

0 8

Below is a stem-and-leaf display for the pulse rates of 24 women at a health clinic How many pulses are between 67 and 77

Stems are 10rsquos digits

1 4

2 6

3 8

4 10

5 12

Other Graphical Methods for Data Time plots

plot observations in time order time on horizontal axis variable on vertical axis

Time series

measurements are taken at regular intervals (monthly unemployment quarterly GDP weather records electricity demand etc)

Heat maps word walls

Unemployment Rate by Educational Attainment

Water Use During Super Bowl XLV(Packers 31 Steelers 25)

Heat Maps

Word Wall (customer feedback)

Section 32Describing the Center of Data

Mean

Median

2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability (next section)

measures how ldquospread outrdquo the data is

Notation for Data Valuesand Sample Mean

1 2

1 2

3

The sample size is denoted by

For a variable denoted by its observations are denoted by

A common measure of center is the sample mean

The sample mean is denoted by

Shorte

n

n

y y yy

n

y

y y y y

y

n

1 21

1

ned expression for using the symbol

(uppercase Greek letter sigma)n

n

i

i n

i

i

y

y y y

yy

n

y

Simple Example of Sample Mean

Weekly TV viewing time in hours of 7 randomly selected 4th graders

19 40 16 12 10 6 and 97

1

7

1

19 40 16 12 10 6 9 112

11216

7 7

ii

ii

y

yy

Population Mean

1

population

population mea

Denoted by the Greek letter

is the size (for example =34000 for NCSU)

the value of is typically not known

we often use the sample mean

to estimat

n

e the unknown

N

ii

y

N N

y

N

value of

Connection Between Mean and Histogram

A histogram balances when supported at the mean Mean x = 1406

Histogram

0

10

20

30

40

50

60

70

118

5

125

5

132

5

139

5

146

5

153

5

16

05

Mo

re

Absences f rom Work

Fre

qu

en

cy

Frequency

The median anothermeasure of center

Given a set of n data values arranged in order of magnitude

Median= middle value n odd

mean of 2 middle values n even

Ex 2 4 6 8 10 n=5 median=6 Ex 2 4 6 8 n=4 median=(4+6)2=5

Student Pulse Rates (n=62)

38 59 60 60 62 62 63 63 64 64 65 67 68 70 70 70 70 70 70 70 71 71 72 72 73 74 74 75 75 75 75 76 77 77 77 77 78 78 79 79 80 80 80 84 84 85 85 87 90 90 91 92 93 94 94 95 96 96 96 98 98 103

Median = (75+76)2 = 755

The median splits the histogram into 2 halves of equal area

Mean balance pointMedian 50 area each half

mean 5526 years median 577years

Medians are used often

Year 2011 baseball salaries

Median $1450000 (max=$32000000 Alex Rodriguez min=$414000)

Median fan age MLB 45 NFL 43 NBA 41 NHL 39

Median existing home sales price May 2011 $166500 May 2010 $174600

Median household income (2008 dollars) 2009 $50221 2008 $52029

Examples Example n = 7

175 28 32 139 141 253 458 Example n = 7 (ordered) 28 32 139 141 175 253 458 Example n = 8

175 28 32 139 141 253 357 458

Example n =8 (ordered)

28 32 139 141 175 253 357 458

m = 141

m = (141+175)2 = 158

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496049604971524555467586

1 5245

2 49655

3 4960

4 4971

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496052455546497155877586

1 5245

2 49655

3 5546

4 4971

Properties of Mean Median1The mean and median are unique that is a

data set has only 1 mean and 1 median (the mean and median are not necessarily equal)

2The mean uses the value of every number in the data set the median does not

14

20 4 6Ex 2 4 6 8 5 5

4 2

21 4 6Ex 2 4 6 9 5 5

4 2

x m

x m

Example class pulse rates

53 64 67 67 70 76 77 77 78 83 84 85 85 89 90 90 90 90 91 96 98 103 140

23

1

23

844823

location 12th obs 85

ii

n

xx

m m

2010 2014 baseball salaries

2010

n = 845

mean = $3297828

median = $1330000

max = $33000000

2014

n = 848

mean = $3932912

median = $1456250

max = $28000000

>

Disadvantage of the mean

Can be greatly influenced by just a few observations that are much greater or much smaller than the rest of the data

Mean Median Maximum Baseball Salaries 1985 - 201419

85

1987

1989

1991

1993

1995

1997

1999

2001

2003

2005

2007

2009

2011

2013

200000

700000

1200000

1700000

2200000

2700000

3200000

3700000

0

5000000

10000000

15000000

20000000

25000000

30000000

35000000

Baseball Salaries Mean Median and Maximum 1985-2014

Mean Median Maximum

Year

Mea

n M

edia

n S

alar

y

Max

imu

m S

alar

y

Skewness comparing the mean and median

Skewed to the right (positively skewed) meangtmedian

53

490

102 7235 21 26 17 8 10 2 3 1 0 0 1

0

100

200

300

400

500

600

Freq

uenc

y

Salary ($1000s)

2011 Baseball Salaries

Skewed to the left negatively skewed

Mean lt median mean=78 median=87

Histogram of Exam Scores

0

10

20

30

20 30 40 50 60 70 80 90 100Exam Scores

Fre

qu

en

cy

Symmetric data

mean median approx equal

Bank Customers 1000-1100 am

0

5

10

15

20

Number of Customers

Fre

qu

en

cy

Section 33Describing Variability of Data

Standard Deviation

Using the Mean and Standard Deviation Together 68-95-997

Rule (Empirical Rule)

Recall 2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability

measures how ldquospread outrdquo the data is

Ways to measure variability

1 range=largest-smallest

ok sometimes in general too crude sensitive to one large or small obs

1

2 where

the middle is the mean

deviation of from the mean

( ) sum the deviations of all the s from

measure spread from the middle

i i

n

i ii

y

y y y

y y y y

1

( ) 0 always tells us nothingn

ii

y y

Example

1 2

1 2

1 2

1 2

sum of deviations from mean

49 51 50

( ) ( ) (49 50) (51 50) 1 1 0

0 100

Data set 1

Data set 2 50

( ) ( ) (0 50) (100 50) 50 50 0

x x x

x x x x

y y y

y y y y

The Sample Standard Deviation a measure of spread around the mean Square the deviation of each

observation from the mean find the square root of the ldquoaveragerdquo of these squared deviations

2

1

2

2 1

( )sample standard deviation

1

( )is called the sample variance

1

n

ii

n

ii

y ys

n

y ys

n

Calculations hellip

Mean = 634

Sum of squared deviations from mean = 852

(n minus 1) = 13 (n minus 1) is called degrees freedom (df)

s2 = variance = 85213 = 655 square inches

s = standard deviation = radic655 = 256 inches

Women height (inches)i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

2

1

2 )(1

1xx

ns

n

i

1 First calculate the variance s22 Then take the square root to get the

standard deviation s

2

1

)(1

1xx

ns

n

i

Meanplusmn 1 sd

Wersquoll never calculate these by hand so make sure to know how to get the standard deviation using your calculator Excel or other software

Population Standard Deviation

2

1

Denoted by the lower case Greek letter

is the size (for example =34000 for NCSU)

is the mean

( )population standard deviation

va

po

lue of typically not known

us

pulation

populatio

e

n

N

ii

N N

y

N

s

to estimate value of

Remarks

1 The standard deviation of a set of measurements is an estimate of the likely size of the chance error in a single measurement

Remarks (cont)

2 Note that s and s are always greater than or equal to zero

3 The larger the value of s (or s ) the greater the spread of the data

When does s=0 When does s =0

When all data values are the same

Remarks (cont)4 The standard deviation is the most

commonly used measure of risk in finance and businessndash Stocks Mutual Funds etc

5 Variance s2 sample variance 2 population variance Units are squared units of the original data square $ square gallons

Review Properties of s and s s and s are always greater than or

equal to 0

when does s = 0 s = 0 The larger the value of s (or s) the

greater the spread of the data the standard deviation of a set of

measurements is an estimate of the likely size of the chance error in a single measurement

Summary of Notation

2

SAMPLE

sample mean

sample median

sample variance

sample stand dev

y

m

s

s

2

POPULATION

population mean

population median

population variance

population stand dev

m

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

68-95-997 rule

Mean andStandard Deviation

(numerical)

Histogram(graphical)

68-95-997 rule

The 68-95-997 ruleIf the histogram of the data is

approximately bell-shaped then1) approximately of the measurements

are of the mean

that is in ( )

2) approximately of the measurement

68

within 1 standard deviation

95

within 2 standard deviation

s

are of the meas n

that is

y s y s

almost all

within 3 standard deviation

in ( 2 2 )

3) the measurements

are of the mean

that is in ( 3 3 )

s

y s y s

y s y s

68-95-997 rule 68 within 1 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

68

3434

y-s y y+s

68-95-997 rule 95 within 2 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

95

475 475

y-2s y y+2s

Example textbook costs

37548

4272

50

y

s

n

286 291 307 308 315 316 327328 340 342 346 347 348 348 349354 355 355 360 361 364 367 369371 373 377 380 381 382 385 385387 390 390 397 398 409 409 410418 422 424 425 426 428 433 434437 440 480

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( ) (33276 41820)

32percentage of data values in this interval 64

5068-95-997 rule 68

y s

y s y s

1 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 2 2 ) (29004 46092)

48percentage of data values in this interval 96

5068-95-997 rule 95

y s

y s y s

2 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 3 3 ) (24732 50364)

50percentage of data values in this interval 100

5068-95-997 rule 997

y s

y s y s

3 standard deviation interval about the mean

The best estimate of the standard deviation of the menrsquos weights

displayed in this dotplot is

1 10

2 15

3 20

4 40

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

Preceding slides Next

Z-scores Standardized Data Values

Measures the distance of a number from the mean in units of

the standard deviation

z-score corresponding to y

where

original data value

the sample mean

s the sample standard deviation

the z-score corresponding to

y yz

s

y

y

z y

Exam 1 y1 = 88 s1 = 6 your exam 1 score 91

Exam 2 y2 = 88 s2 = 10 your exam 2 score 92

Which score is better

1

2

91 88 3z 5

6 692 88 4

z 410 10

91 on exam 1 is better than 92 on exam 2

If data has mean and standard deviation

then standardizing a particular value of

indicates how many standard deviations

is above or below the mean

y s

y

y

y

Comparing SAT and ACT Scores

SAT Math Eleanorrsquos score 680

SAT mean =500 sd=100 ACT Math Geraldrsquos score 27

ACT mean=18 sd=6 Eleanorrsquos z-score z=(680-500)100=18 Geraldrsquos z-score z=(27-18)6=15 Eleanorrsquos score is better

Z-scores add to zeroStudentInstitutional Support to Athletic Depts For the 9 Public ACC

Schools 2013 ($ millions)

School Support y - ybar Z-score

Maryland 155 64 179

UVA 131 40 112

Louisville 109 18 050

UNC 92 01 003

VaTech 79 -12 -034

FSU 79 -12 -034

GaTech 71 -20 -056

NCSU 65 -26 -073

Clemson 38 -53 -147

Mean=91000 s=35697

Sum = 0 Sum = 0

Recently the mean tuition at 4-yr public collegesuniversities in the US was $6185 with a standard deviation of $1804 In NC the mean tuition was $4320 What is NCrsquos z-score

1 103

2 -103

3 239

4 1865

5 -1865

Section 34Measures of Position (also called Measures of Relative Standing)

Quartiles

5-Number Summary

Interquartile Range Another Measure of Spread

Boxplots

m = median = 34

Q1= first quartile = 23

Q3= third quartile = 42

1 1 062 2 123 3 164 4 195 5 156 6 217 7 238 6 239 5 2510 4 2811 3 2912 2 3313 1 3414 2 3615 3 3716 4 3817 5 3918 6 4119 7 4220 6 4521 5 4722 4 4923 3 5324 2 5625 1 61

Quartiles Measuring spread by examining the middleThe first quartile Q1 is the value in the

sample that has 25 of the data at or

below it (Q1 is the median of the lower

half of the sorted data)

The third quartile Q3 is the value in the

sample that has 75 of the data at or

below it (Q3 is the median of the upper

half of the sorted data)

Quartiles and median divide data into 4 pieces

Q1 M Q3

14 14 14 14

Quartiles are common measures of spread

httpoirpncsueduiradmit

httpoirpncsueduunivpeer

University of Southern California

Economic Value of College Majors

Rules for Calculating QuartilesStep 1 find the median of all the data (the median divides the data in half)

Step 2a find the median of the lower half this median is Q1Step 2b find the median of the upper half this median is Q3

Importantwhen n is odd include the overall median in both halveswhen n is even do not include the overall median in either half

Example 2 4 6 8 10 12 14 16 18 20 n = 10

Median m = (10+12)2 = 222 = 11

Q1 median of lower half 2 4 6 8 10

Q1 = 6

Q3 median of upper half 12 14 16 18 20

Q3 = 16

11

Pulse Rates n = 138

Stem Leaves4

3 4 5889 5 00123344410 5 555678889923 6 0001111112223333334444423 6 5555666666777778888888816 7 0000011222233444423 7 5555566666677788888899910 8 000011222410 8 55556677894 9 00122 9 584 10 0223

101 11 1

Median mean of pulses in locations 69 amp 70 median= (70+70)2=70

Q1 median of lower half (lower half = 69 smallest pulses) Q1 = pulse in ordered position 35Q1 = 63

Q3 median of upper half (upper half = 69 largest pulses) Q3= pulse in position 35 from the high end Q3=78

Below are the weights of 31 linemen on the NCSU football team What is the

value of the first quartile Q1

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 287

2 2575

3 2635

4 2625

Interquartile range another measure of spread

lower quartile Q1

middle quartile median upper quartile Q3

interquartile range (IQR)

IQR = Q3 ndash Q1

measures spread of middle 50 of the data

Example beginning pulse rates

Q3 = 78 Q1 = 63

IQR = 78 ndash 63 = 15

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 9: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

1 United States $1582 China $6443 Japan $544 Germany $2445 Britain $2356 France $1937 Brazil $1428 Italy $1319 Australia $12810 India $119

1 United States $13792 Japan $2343 Germany $204 Britain $1685 France $1266 Canada $737 Italy $638 China $54 9 Netherlands $5410 Australia $48

Recent Annual Software Sales ($billions)Recent Annual Computer Hardware Sales ($billion)

NY Times

Percent of people dying fromtop 10 causes of death in the United States

Top 10 causes of death pie chartEach slice represents a piece of one whole The size of a slice depends on what

percent of the whole this category represents

Percent of deaths from top 10 causes

Percent of deaths from

all causes

Make sure your labels match

the data

Make sure all percents

add up to 100

Internships

Basic bar chart Side-by-side bar chart

Trend Student Debt by State (grads of public 4 yr or more)

NewHam

pshir

e

Delawar

e

Minn

esot

a

South

Caroli

na

Alabam

a

Illino

is

Mon

tana

NewJe

rsey

India

na

Wes

tVirg

inia

Wisc

onsin

Idah

o

Kansa

s

Arkan

sas

Kentu

cky

Ore

gon

Nebra

ska

Colora

do

North

Caroli

na

Wyo

ming

Was

hingt

on

Florida

NewYor

k

Okla

hom

a

Califo

rnia

0

5000

10000

15000

20000

25000

30000

35000

40000

2009-10 2012-13 National Average2009-10 $216042012-13 $25043

Campbell University IncNew Life Theological Seminary

Meredith CollegeMid-Atlantic Christian University

Wake Forest UniversityMethodist University

Johnson C Smith UniversityChowan University

Catawba CollegeMars Hill College

Elon UniversityWingate University

Lenoir-Rhyne UniversityDavidson College

St Andrews Presbyterian CollegeDuke University

Belmont Abbey CollegeMean North Carolina - 4-year or above

Brevard CollegeWarren Wilson College

Mount Olive CollegeSalem College

Saint Augustines CollegeHigh Point University

0 20000 40000 60000

North Carolina Private Schools

Tuition and fees (in-state) Average debt of graduates

UNC Greensboro

UNC School of the Arts

NC A amp T

Mean North Carolina - 4-year or above

NCSU

UNC-Wilmington

UNC Charlotte

ECU

Appalachian

UNC Asheville

Elizabeth City

0 5000 10000 15000 20000 25000

North Carolina Public Schools

Tuition and fees (in-state) Average debt of graduates

Student Debt North Carolina Schools

Unnecessary dimension in a pie chart

3rd dimension is unnecessary the 3D pie chart does not convey any more information than a 2D pie chart

Section 31 continuedDisplaying Quantitative Data

Histograms

Stem and Leaf Displays

Frequency HistogramsBAKER CITY HOSPITAL - LENGTH OF STAY

DISTRIBUTION

0

10

20

30

40

50

60

70

0lt2 2lt4 4lt6 6lt8 8lt10 10lt12 12lt14 14lt16 16lt18

Relative Frequency Histogram of Exam Grades

005

10

15

20

25

30

40 50 60 70 80 90Grade

Rel

ativ

e fr

eque

ncy

100

Histograms

A histogram shows three general types of information

It provides visual indication of where the approximate center of the data is

We can gain an understanding of the degree of spread or variation in the data

We can observe the shape of the distribution

Histograms Showing Different Centers

0

10

20

30

40

50

60

70

0lt2 2lt4 4lt6 6lt8 8lt10 10lt12 12lt14 14lt16 16lt18

0

10

20

30

40

50

60

70

0lt2 2lt4 4lt6 6lt8 8lt10 10lt12 12lt14 14lt16 16lt18

Histograms - Same Center Different Spread

0

10

20

30

40

50

60

70

0lt2

2lt4

4lt6

6lt8

8lt10

10lt12

12lt14

14lt16

16lt18

0

10

20

30

40

50

60

70

0lt2 2lt4 4lt6 6lt8 8lt10 10lt12 12lt14 14lt16 16lt18

Histograms Shape

A distribution is symmetric if the right and left

sides of the histogram are approximately mirror

images of each other

Symmetric distribution

Complex multimodal distribution

Not all distributions have a simple overall shape

especially when there are few observations

Skewed distribution

A distribution is skewed to the right if the right

side of the histogram (side with larger values)

extends much farther out than the left side It is

skewed to the left if the left side of the histogram

extends much farther out than the right side

Shape (cont)Female heart attack patients in New York state

Age left-skewed Cost right-skewed

Shape (cont) outliersAll 200 m Races 202 secs or less

192 1926193219381944 195 1956196219681974 198 1986199219982004 201 20160

10

20

30

40

50

60

200 m Races 202 secs or less (approx 700)

TIMES

Fre

qu

ency Usain Bolt

2008 1930Michael Johnson1996 1932

Alaska Florida

Shape (cont) Outliers

An important kind of deviation is an outlier Outliers are observations

that lie outside the overall pattern of a distribution Always look for

outliers and try to explain them

The overall pattern is fairly

symmetrical except for 2

states clearly not belonging

to the main trend Alaska

and Florida have unusual

representation of the

elderly in their population

A large gap in the

distribution is typically a

sign of an outlier

Excel Example 2012-13 NFL Salaries

3694

80

1273

609

231

2177

738

462

3081

867

692

3985

996

923

4890

126

154

5794

255

385

6698

384

615

7602

513

846

8506

643

077

9410

772

308

1031

4901

54

1121

9030

77

1212

3160

1302

7289

23

1393

1418

46

1483

5547

69

1573

9676

92

1664

3806

15

1754

7935

38

0

100

200

300

400

500

600

700

800

900

1000

Histogram

Bin

Fre

qu

ency

Statcrunch Example 2012-13 NFL Salaries

Heights of Students in Recent Stats Class (Bimodal)

ExampleGrades on a statistics exam

Data

75 66 77 66 64 73 91 65 59 86 61 86 61

58 70 77 80 58 94 78 62 79 83 54 52 45

82 48 67 55

Example-2Frequency Distribution of Grades

Class Limits Frequency40 up to 50

50 up to 60

60 up to 70

70 up to 80

80 up to 90

90 up to 100

Total

2

6

8

7

5

2

30

Example-3 Relative Frequency Distribution of Grades

Class Limits Relative Frequency40 up to 50

50 up to 60

60 up to 70

70 up to 80

80 up to 90

90 up to 100

230 = 067

630 = 200

830 = 267

730 = 233

530 = 167

230 = 067

Relative Frequency Histogram of Grades

005

10

15

20

25

30

40 50 60 70 80 90Grade

Rel

ativ

e fr

eque

ncy

100

Based on the histo-gram about what percent of the values are between 475 and 525

1 50

2 5

3 17

4 30

Stem and leaf displays Have the following general appearance

stem leaf

1 8 9

2 1 2 8 9 9

3 2 3 8 9

4 0 1

5 6 7

6 4

Example employee ages at a small company

18 21 22 19 32 33 40 41 56 57 64 28 29 29 38 39 stem 10rsquos digit leaf 1rsquos digit

18 stem=1 leaf=8 18 = 1 | 8

stem leaf

1 8 9

2 1 2 8 9 9

3 2 3 8 9

4 0 1

5 6 7

6 4

Suppose a 95 yr old is hiredstem leaf

1 8 9

2 1 2 8 9 9

3 2 3 8 9

4 0 1

5 6 7

6 4

7

8

9 5

Number of TD passes by NFL teams 2012-2013 season(stems are 10rsquos digit)

stem leaf

43

03247

2 6677789

2 01222233444

1 13467889

0 8

Pulse Rates n = 138

Stem Leaves 4 3 4 588 9 5 001233444 10 5 5556788899 23 6 00011111122233333344444 23 6 55556666667777788888888 16 7 00000112222334444 23 7 55555666666777888888999 10 8 0000112224 10 8 5555667789 4 9 0012 2 9 58 4 10 0223 10 1 11 1

AdvantagesDisadvantages of Stem-and-Leaf Displays

Advantages

1) each measurement displayed

2) ascending order in each stem row

3) relatively simple (data set not too large) Disadvantages

display becomes unwieldy for large data sets

Population of 185 US cities with between 100000 and 500000

Multiply stems by 100000

Back-to-back stem-and-leaf displays TD passes by NFL teams 1999-2000 2012-13multiply stems by 10

1999-2000 2012-13

2 4 03

6 3 7

2 3 24

6655 2 6677789

43322221100 2 01222233444

9998887666 1 67889

421 1 134

0 8

Below is a stem-and-leaf display for the pulse rates of 24 women at a health clinic How many pulses are between 67 and 77

Stems are 10rsquos digits

1 4

2 6

3 8

4 10

5 12

Other Graphical Methods for Data Time plots

plot observations in time order time on horizontal axis variable on vertical axis

Time series

measurements are taken at regular intervals (monthly unemployment quarterly GDP weather records electricity demand etc)

Heat maps word walls

Unemployment Rate by Educational Attainment

Water Use During Super Bowl XLV(Packers 31 Steelers 25)

Heat Maps

Word Wall (customer feedback)

Section 32Describing the Center of Data

Mean

Median

2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability (next section)

measures how ldquospread outrdquo the data is

Notation for Data Valuesand Sample Mean

1 2

1 2

3

The sample size is denoted by

For a variable denoted by its observations are denoted by

A common measure of center is the sample mean

The sample mean is denoted by

Shorte

n

n

y y yy

n

y

y y y y

y

n

1 21

1

ned expression for using the symbol

(uppercase Greek letter sigma)n

n

i

i n

i

i

y

y y y

yy

n

y

Simple Example of Sample Mean

Weekly TV viewing time in hours of 7 randomly selected 4th graders

19 40 16 12 10 6 and 97

1

7

1

19 40 16 12 10 6 9 112

11216

7 7

ii

ii

y

yy

Population Mean

1

population

population mea

Denoted by the Greek letter

is the size (for example =34000 for NCSU)

the value of is typically not known

we often use the sample mean

to estimat

n

e the unknown

N

ii

y

N N

y

N

value of

Connection Between Mean and Histogram

A histogram balances when supported at the mean Mean x = 1406

Histogram

0

10

20

30

40

50

60

70

118

5

125

5

132

5

139

5

146

5

153

5

16

05

Mo

re

Absences f rom Work

Fre

qu

en

cy

Frequency

The median anothermeasure of center

Given a set of n data values arranged in order of magnitude

Median= middle value n odd

mean of 2 middle values n even

Ex 2 4 6 8 10 n=5 median=6 Ex 2 4 6 8 n=4 median=(4+6)2=5

Student Pulse Rates (n=62)

38 59 60 60 62 62 63 63 64 64 65 67 68 70 70 70 70 70 70 70 71 71 72 72 73 74 74 75 75 75 75 76 77 77 77 77 78 78 79 79 80 80 80 84 84 85 85 87 90 90 91 92 93 94 94 95 96 96 96 98 98 103

Median = (75+76)2 = 755

The median splits the histogram into 2 halves of equal area

Mean balance pointMedian 50 area each half

mean 5526 years median 577years

Medians are used often

Year 2011 baseball salaries

Median $1450000 (max=$32000000 Alex Rodriguez min=$414000)

Median fan age MLB 45 NFL 43 NBA 41 NHL 39

Median existing home sales price May 2011 $166500 May 2010 $174600

Median household income (2008 dollars) 2009 $50221 2008 $52029

Examples Example n = 7

175 28 32 139 141 253 458 Example n = 7 (ordered) 28 32 139 141 175 253 458 Example n = 8

175 28 32 139 141 253 357 458

Example n =8 (ordered)

28 32 139 141 175 253 357 458

m = 141

m = (141+175)2 = 158

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496049604971524555467586

1 5245

2 49655

3 4960

4 4971

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496052455546497155877586

1 5245

2 49655

3 5546

4 4971

Properties of Mean Median1The mean and median are unique that is a

data set has only 1 mean and 1 median (the mean and median are not necessarily equal)

2The mean uses the value of every number in the data set the median does not

14

20 4 6Ex 2 4 6 8 5 5

4 2

21 4 6Ex 2 4 6 9 5 5

4 2

x m

x m

Example class pulse rates

53 64 67 67 70 76 77 77 78 83 84 85 85 89 90 90 90 90 91 96 98 103 140

23

1

23

844823

location 12th obs 85

ii

n

xx

m m

2010 2014 baseball salaries

2010

n = 845

mean = $3297828

median = $1330000

max = $33000000

2014

n = 848

mean = $3932912

median = $1456250

max = $28000000

>

Disadvantage of the mean

Can be greatly influenced by just a few observations that are much greater or much smaller than the rest of the data

Mean Median Maximum Baseball Salaries 1985 - 201419

85

1987

1989

1991

1993

1995

1997

1999

2001

2003

2005

2007

2009

2011

2013

200000

700000

1200000

1700000

2200000

2700000

3200000

3700000

0

5000000

10000000

15000000

20000000

25000000

30000000

35000000

Baseball Salaries Mean Median and Maximum 1985-2014

Mean Median Maximum

Year

Mea

n M

edia

n S

alar

y

Max

imu

m S

alar

y

Skewness comparing the mean and median

Skewed to the right (positively skewed) meangtmedian

53

490

102 7235 21 26 17 8 10 2 3 1 0 0 1

0

100

200

300

400

500

600

Freq

uenc

y

Salary ($1000s)

2011 Baseball Salaries

Skewed to the left negatively skewed

Mean lt median mean=78 median=87

Histogram of Exam Scores

0

10

20

30

20 30 40 50 60 70 80 90 100Exam Scores

Fre

qu

en

cy

Symmetric data

mean median approx equal

Bank Customers 1000-1100 am

0

5

10

15

20

Number of Customers

Fre

qu

en

cy

Section 33Describing Variability of Data

Standard Deviation

Using the Mean and Standard Deviation Together 68-95-997

Rule (Empirical Rule)

Recall 2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability

measures how ldquospread outrdquo the data is

Ways to measure variability

1 range=largest-smallest

ok sometimes in general too crude sensitive to one large or small obs

1

2 where

the middle is the mean

deviation of from the mean

( ) sum the deviations of all the s from

measure spread from the middle

i i

n

i ii

y

y y y

y y y y

1

( ) 0 always tells us nothingn

ii

y y

Example

1 2

1 2

1 2

1 2

sum of deviations from mean

49 51 50

( ) ( ) (49 50) (51 50) 1 1 0

0 100

Data set 1

Data set 2 50

( ) ( ) (0 50) (100 50) 50 50 0

x x x

x x x x

y y y

y y y y

The Sample Standard Deviation a measure of spread around the mean Square the deviation of each

observation from the mean find the square root of the ldquoaveragerdquo of these squared deviations

2

1

2

2 1

( )sample standard deviation

1

( )is called the sample variance

1

n

ii

n

ii

y ys

n

y ys

n

Calculations hellip

Mean = 634

Sum of squared deviations from mean = 852

(n minus 1) = 13 (n minus 1) is called degrees freedom (df)

s2 = variance = 85213 = 655 square inches

s = standard deviation = radic655 = 256 inches

Women height (inches)i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

2

1

2 )(1

1xx

ns

n

i

1 First calculate the variance s22 Then take the square root to get the

standard deviation s

2

1

)(1

1xx

ns

n

i

Meanplusmn 1 sd

Wersquoll never calculate these by hand so make sure to know how to get the standard deviation using your calculator Excel or other software

Population Standard Deviation

2

1

Denoted by the lower case Greek letter

is the size (for example =34000 for NCSU)

is the mean

( )population standard deviation

va

po

lue of typically not known

us

pulation

populatio

e

n

N

ii

N N

y

N

s

to estimate value of

Remarks

1 The standard deviation of a set of measurements is an estimate of the likely size of the chance error in a single measurement

Remarks (cont)

2 Note that s and s are always greater than or equal to zero

3 The larger the value of s (or s ) the greater the spread of the data

When does s=0 When does s =0

When all data values are the same

Remarks (cont)4 The standard deviation is the most

commonly used measure of risk in finance and businessndash Stocks Mutual Funds etc

5 Variance s2 sample variance 2 population variance Units are squared units of the original data square $ square gallons

Review Properties of s and s s and s are always greater than or

equal to 0

when does s = 0 s = 0 The larger the value of s (or s) the

greater the spread of the data the standard deviation of a set of

measurements is an estimate of the likely size of the chance error in a single measurement

Summary of Notation

2

SAMPLE

sample mean

sample median

sample variance

sample stand dev

y

m

s

s

2

POPULATION

population mean

population median

population variance

population stand dev

m

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

68-95-997 rule

Mean andStandard Deviation

(numerical)

Histogram(graphical)

68-95-997 rule

The 68-95-997 ruleIf the histogram of the data is

approximately bell-shaped then1) approximately of the measurements

are of the mean

that is in ( )

2) approximately of the measurement

68

within 1 standard deviation

95

within 2 standard deviation

s

are of the meas n

that is

y s y s

almost all

within 3 standard deviation

in ( 2 2 )

3) the measurements

are of the mean

that is in ( 3 3 )

s

y s y s

y s y s

68-95-997 rule 68 within 1 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

68

3434

y-s y y+s

68-95-997 rule 95 within 2 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

95

475 475

y-2s y y+2s

Example textbook costs

37548

4272

50

y

s

n

286 291 307 308 315 316 327328 340 342 346 347 348 348 349354 355 355 360 361 364 367 369371 373 377 380 381 382 385 385387 390 390 397 398 409 409 410418 422 424 425 426 428 433 434437 440 480

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( ) (33276 41820)

32percentage of data values in this interval 64

5068-95-997 rule 68

y s

y s y s

1 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 2 2 ) (29004 46092)

48percentage of data values in this interval 96

5068-95-997 rule 95

y s

y s y s

2 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 3 3 ) (24732 50364)

50percentage of data values in this interval 100

5068-95-997 rule 997

y s

y s y s

3 standard deviation interval about the mean

The best estimate of the standard deviation of the menrsquos weights

displayed in this dotplot is

1 10

2 15

3 20

4 40

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

Preceding slides Next

Z-scores Standardized Data Values

Measures the distance of a number from the mean in units of

the standard deviation

z-score corresponding to y

where

original data value

the sample mean

s the sample standard deviation

the z-score corresponding to

y yz

s

y

y

z y

Exam 1 y1 = 88 s1 = 6 your exam 1 score 91

Exam 2 y2 = 88 s2 = 10 your exam 2 score 92

Which score is better

1

2

91 88 3z 5

6 692 88 4

z 410 10

91 on exam 1 is better than 92 on exam 2

If data has mean and standard deviation

then standardizing a particular value of

indicates how many standard deviations

is above or below the mean

y s

y

y

y

Comparing SAT and ACT Scores

SAT Math Eleanorrsquos score 680

SAT mean =500 sd=100 ACT Math Geraldrsquos score 27

ACT mean=18 sd=6 Eleanorrsquos z-score z=(680-500)100=18 Geraldrsquos z-score z=(27-18)6=15 Eleanorrsquos score is better

Z-scores add to zeroStudentInstitutional Support to Athletic Depts For the 9 Public ACC

Schools 2013 ($ millions)

School Support y - ybar Z-score

Maryland 155 64 179

UVA 131 40 112

Louisville 109 18 050

UNC 92 01 003

VaTech 79 -12 -034

FSU 79 -12 -034

GaTech 71 -20 -056

NCSU 65 -26 -073

Clemson 38 -53 -147

Mean=91000 s=35697

Sum = 0 Sum = 0

Recently the mean tuition at 4-yr public collegesuniversities in the US was $6185 with a standard deviation of $1804 In NC the mean tuition was $4320 What is NCrsquos z-score

1 103

2 -103

3 239

4 1865

5 -1865

Section 34Measures of Position (also called Measures of Relative Standing)

Quartiles

5-Number Summary

Interquartile Range Another Measure of Spread

Boxplots

m = median = 34

Q1= first quartile = 23

Q3= third quartile = 42

1 1 062 2 123 3 164 4 195 5 156 6 217 7 238 6 239 5 2510 4 2811 3 2912 2 3313 1 3414 2 3615 3 3716 4 3817 5 3918 6 4119 7 4220 6 4521 5 4722 4 4923 3 5324 2 5625 1 61

Quartiles Measuring spread by examining the middleThe first quartile Q1 is the value in the

sample that has 25 of the data at or

below it (Q1 is the median of the lower

half of the sorted data)

The third quartile Q3 is the value in the

sample that has 75 of the data at or

below it (Q3 is the median of the upper

half of the sorted data)

Quartiles and median divide data into 4 pieces

Q1 M Q3

14 14 14 14

Quartiles are common measures of spread

httpoirpncsueduiradmit

httpoirpncsueduunivpeer

University of Southern California

Economic Value of College Majors

Rules for Calculating QuartilesStep 1 find the median of all the data (the median divides the data in half)

Step 2a find the median of the lower half this median is Q1Step 2b find the median of the upper half this median is Q3

Importantwhen n is odd include the overall median in both halveswhen n is even do not include the overall median in either half

Example 2 4 6 8 10 12 14 16 18 20 n = 10

Median m = (10+12)2 = 222 = 11

Q1 median of lower half 2 4 6 8 10

Q1 = 6

Q3 median of upper half 12 14 16 18 20

Q3 = 16

11

Pulse Rates n = 138

Stem Leaves4

3 4 5889 5 00123344410 5 555678889923 6 0001111112223333334444423 6 5555666666777778888888816 7 0000011222233444423 7 5555566666677788888899910 8 000011222410 8 55556677894 9 00122 9 584 10 0223

101 11 1

Median mean of pulses in locations 69 amp 70 median= (70+70)2=70

Q1 median of lower half (lower half = 69 smallest pulses) Q1 = pulse in ordered position 35Q1 = 63

Q3 median of upper half (upper half = 69 largest pulses) Q3= pulse in position 35 from the high end Q3=78

Below are the weights of 31 linemen on the NCSU football team What is the

value of the first quartile Q1

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 287

2 2575

3 2635

4 2625

Interquartile range another measure of spread

lower quartile Q1

middle quartile median upper quartile Q3

interquartile range (IQR)

IQR = Q3 ndash Q1

measures spread of middle 50 of the data

Example beginning pulse rates

Q3 = 78 Q1 = 63

IQR = 78 ndash 63 = 15

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 10: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Percent of people dying fromtop 10 causes of death in the United States

Top 10 causes of death pie chartEach slice represents a piece of one whole The size of a slice depends on what

percent of the whole this category represents

Percent of deaths from top 10 causes

Percent of deaths from

all causes

Make sure your labels match

the data

Make sure all percents

add up to 100

Internships

Basic bar chart Side-by-side bar chart

Trend Student Debt by State (grads of public 4 yr or more)

NewHam

pshir

e

Delawar

e

Minn

esot

a

South

Caroli

na

Alabam

a

Illino

is

Mon

tana

NewJe

rsey

India

na

Wes

tVirg

inia

Wisc

onsin

Idah

o

Kansa

s

Arkan

sas

Kentu

cky

Ore

gon

Nebra

ska

Colora

do

North

Caroli

na

Wyo

ming

Was

hingt

on

Florida

NewYor

k

Okla

hom

a

Califo

rnia

0

5000

10000

15000

20000

25000

30000

35000

40000

2009-10 2012-13 National Average2009-10 $216042012-13 $25043

Campbell University IncNew Life Theological Seminary

Meredith CollegeMid-Atlantic Christian University

Wake Forest UniversityMethodist University

Johnson C Smith UniversityChowan University

Catawba CollegeMars Hill College

Elon UniversityWingate University

Lenoir-Rhyne UniversityDavidson College

St Andrews Presbyterian CollegeDuke University

Belmont Abbey CollegeMean North Carolina - 4-year or above

Brevard CollegeWarren Wilson College

Mount Olive CollegeSalem College

Saint Augustines CollegeHigh Point University

0 20000 40000 60000

North Carolina Private Schools

Tuition and fees (in-state) Average debt of graduates

UNC Greensboro

UNC School of the Arts

NC A amp T

Mean North Carolina - 4-year or above

NCSU

UNC-Wilmington

UNC Charlotte

ECU

Appalachian

UNC Asheville

Elizabeth City

0 5000 10000 15000 20000 25000

North Carolina Public Schools

Tuition and fees (in-state) Average debt of graduates

Student Debt North Carolina Schools

Unnecessary dimension in a pie chart

3rd dimension is unnecessary the 3D pie chart does not convey any more information than a 2D pie chart

Section 31 continuedDisplaying Quantitative Data

Histograms

Stem and Leaf Displays

Frequency HistogramsBAKER CITY HOSPITAL - LENGTH OF STAY

DISTRIBUTION

0

10

20

30

40

50

60

70

0lt2 2lt4 4lt6 6lt8 8lt10 10lt12 12lt14 14lt16 16lt18

Relative Frequency Histogram of Exam Grades

005

10

15

20

25

30

40 50 60 70 80 90Grade

Rel

ativ

e fr

eque

ncy

100

Histograms

A histogram shows three general types of information

It provides visual indication of where the approximate center of the data is

We can gain an understanding of the degree of spread or variation in the data

We can observe the shape of the distribution

Histograms Showing Different Centers

0

10

20

30

40

50

60

70

0lt2 2lt4 4lt6 6lt8 8lt10 10lt12 12lt14 14lt16 16lt18

0

10

20

30

40

50

60

70

0lt2 2lt4 4lt6 6lt8 8lt10 10lt12 12lt14 14lt16 16lt18

Histograms - Same Center Different Spread

0

10

20

30

40

50

60

70

0lt2

2lt4

4lt6

6lt8

8lt10

10lt12

12lt14

14lt16

16lt18

0

10

20

30

40

50

60

70

0lt2 2lt4 4lt6 6lt8 8lt10 10lt12 12lt14 14lt16 16lt18

Histograms Shape

A distribution is symmetric if the right and left

sides of the histogram are approximately mirror

images of each other

Symmetric distribution

Complex multimodal distribution

Not all distributions have a simple overall shape

especially when there are few observations

Skewed distribution

A distribution is skewed to the right if the right

side of the histogram (side with larger values)

extends much farther out than the left side It is

skewed to the left if the left side of the histogram

extends much farther out than the right side

Shape (cont)Female heart attack patients in New York state

Age left-skewed Cost right-skewed

Shape (cont) outliersAll 200 m Races 202 secs or less

192 1926193219381944 195 1956196219681974 198 1986199219982004 201 20160

10

20

30

40

50

60

200 m Races 202 secs or less (approx 700)

TIMES

Fre

qu

ency Usain Bolt

2008 1930Michael Johnson1996 1932

Alaska Florida

Shape (cont) Outliers

An important kind of deviation is an outlier Outliers are observations

that lie outside the overall pattern of a distribution Always look for

outliers and try to explain them

The overall pattern is fairly

symmetrical except for 2

states clearly not belonging

to the main trend Alaska

and Florida have unusual

representation of the

elderly in their population

A large gap in the

distribution is typically a

sign of an outlier

Excel Example 2012-13 NFL Salaries

3694

80

1273

609

231

2177

738

462

3081

867

692

3985

996

923

4890

126

154

5794

255

385

6698

384

615

7602

513

846

8506

643

077

9410

772

308

1031

4901

54

1121

9030

77

1212

3160

1302

7289

23

1393

1418

46

1483

5547

69

1573

9676

92

1664

3806

15

1754

7935

38

0

100

200

300

400

500

600

700

800

900

1000

Histogram

Bin

Fre

qu

ency

Statcrunch Example 2012-13 NFL Salaries

Heights of Students in Recent Stats Class (Bimodal)

ExampleGrades on a statistics exam

Data

75 66 77 66 64 73 91 65 59 86 61 86 61

58 70 77 80 58 94 78 62 79 83 54 52 45

82 48 67 55

Example-2Frequency Distribution of Grades

Class Limits Frequency40 up to 50

50 up to 60

60 up to 70

70 up to 80

80 up to 90

90 up to 100

Total

2

6

8

7

5

2

30

Example-3 Relative Frequency Distribution of Grades

Class Limits Relative Frequency40 up to 50

50 up to 60

60 up to 70

70 up to 80

80 up to 90

90 up to 100

230 = 067

630 = 200

830 = 267

730 = 233

530 = 167

230 = 067

Relative Frequency Histogram of Grades

005

10

15

20

25

30

40 50 60 70 80 90Grade

Rel

ativ

e fr

eque

ncy

100

Based on the histo-gram about what percent of the values are between 475 and 525

1 50

2 5

3 17

4 30

Stem and leaf displays Have the following general appearance

stem leaf

1 8 9

2 1 2 8 9 9

3 2 3 8 9

4 0 1

5 6 7

6 4

Example employee ages at a small company

18 21 22 19 32 33 40 41 56 57 64 28 29 29 38 39 stem 10rsquos digit leaf 1rsquos digit

18 stem=1 leaf=8 18 = 1 | 8

stem leaf

1 8 9

2 1 2 8 9 9

3 2 3 8 9

4 0 1

5 6 7

6 4

Suppose a 95 yr old is hiredstem leaf

1 8 9

2 1 2 8 9 9

3 2 3 8 9

4 0 1

5 6 7

6 4

7

8

9 5

Number of TD passes by NFL teams 2012-2013 season(stems are 10rsquos digit)

stem leaf

43

03247

2 6677789

2 01222233444

1 13467889

0 8

Pulse Rates n = 138

Stem Leaves 4 3 4 588 9 5 001233444 10 5 5556788899 23 6 00011111122233333344444 23 6 55556666667777788888888 16 7 00000112222334444 23 7 55555666666777888888999 10 8 0000112224 10 8 5555667789 4 9 0012 2 9 58 4 10 0223 10 1 11 1

AdvantagesDisadvantages of Stem-and-Leaf Displays

Advantages

1) each measurement displayed

2) ascending order in each stem row

3) relatively simple (data set not too large) Disadvantages

display becomes unwieldy for large data sets

Population of 185 US cities with between 100000 and 500000

Multiply stems by 100000

Back-to-back stem-and-leaf displays TD passes by NFL teams 1999-2000 2012-13multiply stems by 10

1999-2000 2012-13

2 4 03

6 3 7

2 3 24

6655 2 6677789

43322221100 2 01222233444

9998887666 1 67889

421 1 134

0 8

Below is a stem-and-leaf display for the pulse rates of 24 women at a health clinic How many pulses are between 67 and 77

Stems are 10rsquos digits

1 4

2 6

3 8

4 10

5 12

Other Graphical Methods for Data Time plots

plot observations in time order time on horizontal axis variable on vertical axis

Time series

measurements are taken at regular intervals (monthly unemployment quarterly GDP weather records electricity demand etc)

Heat maps word walls

Unemployment Rate by Educational Attainment

Water Use During Super Bowl XLV(Packers 31 Steelers 25)

Heat Maps

Word Wall (customer feedback)

Section 32Describing the Center of Data

Mean

Median

2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability (next section)

measures how ldquospread outrdquo the data is

Notation for Data Valuesand Sample Mean

1 2

1 2

3

The sample size is denoted by

For a variable denoted by its observations are denoted by

A common measure of center is the sample mean

The sample mean is denoted by

Shorte

n

n

y y yy

n

y

y y y y

y

n

1 21

1

ned expression for using the symbol

(uppercase Greek letter sigma)n

n

i

i n

i

i

y

y y y

yy

n

y

Simple Example of Sample Mean

Weekly TV viewing time in hours of 7 randomly selected 4th graders

19 40 16 12 10 6 and 97

1

7

1

19 40 16 12 10 6 9 112

11216

7 7

ii

ii

y

yy

Population Mean

1

population

population mea

Denoted by the Greek letter

is the size (for example =34000 for NCSU)

the value of is typically not known

we often use the sample mean

to estimat

n

e the unknown

N

ii

y

N N

y

N

value of

Connection Between Mean and Histogram

A histogram balances when supported at the mean Mean x = 1406

Histogram

0

10

20

30

40

50

60

70

118

5

125

5

132

5

139

5

146

5

153

5

16

05

Mo

re

Absences f rom Work

Fre

qu

en

cy

Frequency

The median anothermeasure of center

Given a set of n data values arranged in order of magnitude

Median= middle value n odd

mean of 2 middle values n even

Ex 2 4 6 8 10 n=5 median=6 Ex 2 4 6 8 n=4 median=(4+6)2=5

Student Pulse Rates (n=62)

38 59 60 60 62 62 63 63 64 64 65 67 68 70 70 70 70 70 70 70 71 71 72 72 73 74 74 75 75 75 75 76 77 77 77 77 78 78 79 79 80 80 80 84 84 85 85 87 90 90 91 92 93 94 94 95 96 96 96 98 98 103

Median = (75+76)2 = 755

The median splits the histogram into 2 halves of equal area

Mean balance pointMedian 50 area each half

mean 5526 years median 577years

Medians are used often

Year 2011 baseball salaries

Median $1450000 (max=$32000000 Alex Rodriguez min=$414000)

Median fan age MLB 45 NFL 43 NBA 41 NHL 39

Median existing home sales price May 2011 $166500 May 2010 $174600

Median household income (2008 dollars) 2009 $50221 2008 $52029

Examples Example n = 7

175 28 32 139 141 253 458 Example n = 7 (ordered) 28 32 139 141 175 253 458 Example n = 8

175 28 32 139 141 253 357 458

Example n =8 (ordered)

28 32 139 141 175 253 357 458

m = 141

m = (141+175)2 = 158

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496049604971524555467586

1 5245

2 49655

3 4960

4 4971

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496052455546497155877586

1 5245

2 49655

3 5546

4 4971

Properties of Mean Median1The mean and median are unique that is a

data set has only 1 mean and 1 median (the mean and median are not necessarily equal)

2The mean uses the value of every number in the data set the median does not

14

20 4 6Ex 2 4 6 8 5 5

4 2

21 4 6Ex 2 4 6 9 5 5

4 2

x m

x m

Example class pulse rates

53 64 67 67 70 76 77 77 78 83 84 85 85 89 90 90 90 90 91 96 98 103 140

23

1

23

844823

location 12th obs 85

ii

n

xx

m m

2010 2014 baseball salaries

2010

n = 845

mean = $3297828

median = $1330000

max = $33000000

2014

n = 848

mean = $3932912

median = $1456250

max = $28000000

>

Disadvantage of the mean

Can be greatly influenced by just a few observations that are much greater or much smaller than the rest of the data

Mean Median Maximum Baseball Salaries 1985 - 201419

85

1987

1989

1991

1993

1995

1997

1999

2001

2003

2005

2007

2009

2011

2013

200000

700000

1200000

1700000

2200000

2700000

3200000

3700000

0

5000000

10000000

15000000

20000000

25000000

30000000

35000000

Baseball Salaries Mean Median and Maximum 1985-2014

Mean Median Maximum

Year

Mea

n M

edia

n S

alar

y

Max

imu

m S

alar

y

Skewness comparing the mean and median

Skewed to the right (positively skewed) meangtmedian

53

490

102 7235 21 26 17 8 10 2 3 1 0 0 1

0

100

200

300

400

500

600

Freq

uenc

y

Salary ($1000s)

2011 Baseball Salaries

Skewed to the left negatively skewed

Mean lt median mean=78 median=87

Histogram of Exam Scores

0

10

20

30

20 30 40 50 60 70 80 90 100Exam Scores

Fre

qu

en

cy

Symmetric data

mean median approx equal

Bank Customers 1000-1100 am

0

5

10

15

20

Number of Customers

Fre

qu

en

cy

Section 33Describing Variability of Data

Standard Deviation

Using the Mean and Standard Deviation Together 68-95-997

Rule (Empirical Rule)

Recall 2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability

measures how ldquospread outrdquo the data is

Ways to measure variability

1 range=largest-smallest

ok sometimes in general too crude sensitive to one large or small obs

1

2 where

the middle is the mean

deviation of from the mean

( ) sum the deviations of all the s from

measure spread from the middle

i i

n

i ii

y

y y y

y y y y

1

( ) 0 always tells us nothingn

ii

y y

Example

1 2

1 2

1 2

1 2

sum of deviations from mean

49 51 50

( ) ( ) (49 50) (51 50) 1 1 0

0 100

Data set 1

Data set 2 50

( ) ( ) (0 50) (100 50) 50 50 0

x x x

x x x x

y y y

y y y y

The Sample Standard Deviation a measure of spread around the mean Square the deviation of each

observation from the mean find the square root of the ldquoaveragerdquo of these squared deviations

2

1

2

2 1

( )sample standard deviation

1

( )is called the sample variance

1

n

ii

n

ii

y ys

n

y ys

n

Calculations hellip

Mean = 634

Sum of squared deviations from mean = 852

(n minus 1) = 13 (n minus 1) is called degrees freedom (df)

s2 = variance = 85213 = 655 square inches

s = standard deviation = radic655 = 256 inches

Women height (inches)i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

2

1

2 )(1

1xx

ns

n

i

1 First calculate the variance s22 Then take the square root to get the

standard deviation s

2

1

)(1

1xx

ns

n

i

Meanplusmn 1 sd

Wersquoll never calculate these by hand so make sure to know how to get the standard deviation using your calculator Excel or other software

Population Standard Deviation

2

1

Denoted by the lower case Greek letter

is the size (for example =34000 for NCSU)

is the mean

( )population standard deviation

va

po

lue of typically not known

us

pulation

populatio

e

n

N

ii

N N

y

N

s

to estimate value of

Remarks

1 The standard deviation of a set of measurements is an estimate of the likely size of the chance error in a single measurement

Remarks (cont)

2 Note that s and s are always greater than or equal to zero

3 The larger the value of s (or s ) the greater the spread of the data

When does s=0 When does s =0

When all data values are the same

Remarks (cont)4 The standard deviation is the most

commonly used measure of risk in finance and businessndash Stocks Mutual Funds etc

5 Variance s2 sample variance 2 population variance Units are squared units of the original data square $ square gallons

Review Properties of s and s s and s are always greater than or

equal to 0

when does s = 0 s = 0 The larger the value of s (or s) the

greater the spread of the data the standard deviation of a set of

measurements is an estimate of the likely size of the chance error in a single measurement

Summary of Notation

2

SAMPLE

sample mean

sample median

sample variance

sample stand dev

y

m

s

s

2

POPULATION

population mean

population median

population variance

population stand dev

m

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

68-95-997 rule

Mean andStandard Deviation

(numerical)

Histogram(graphical)

68-95-997 rule

The 68-95-997 ruleIf the histogram of the data is

approximately bell-shaped then1) approximately of the measurements

are of the mean

that is in ( )

2) approximately of the measurement

68

within 1 standard deviation

95

within 2 standard deviation

s

are of the meas n

that is

y s y s

almost all

within 3 standard deviation

in ( 2 2 )

3) the measurements

are of the mean

that is in ( 3 3 )

s

y s y s

y s y s

68-95-997 rule 68 within 1 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

68

3434

y-s y y+s

68-95-997 rule 95 within 2 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

95

475 475

y-2s y y+2s

Example textbook costs

37548

4272

50

y

s

n

286 291 307 308 315 316 327328 340 342 346 347 348 348 349354 355 355 360 361 364 367 369371 373 377 380 381 382 385 385387 390 390 397 398 409 409 410418 422 424 425 426 428 433 434437 440 480

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( ) (33276 41820)

32percentage of data values in this interval 64

5068-95-997 rule 68

y s

y s y s

1 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 2 2 ) (29004 46092)

48percentage of data values in this interval 96

5068-95-997 rule 95

y s

y s y s

2 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 3 3 ) (24732 50364)

50percentage of data values in this interval 100

5068-95-997 rule 997

y s

y s y s

3 standard deviation interval about the mean

The best estimate of the standard deviation of the menrsquos weights

displayed in this dotplot is

1 10

2 15

3 20

4 40

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

Preceding slides Next

Z-scores Standardized Data Values

Measures the distance of a number from the mean in units of

the standard deviation

z-score corresponding to y

where

original data value

the sample mean

s the sample standard deviation

the z-score corresponding to

y yz

s

y

y

z y

Exam 1 y1 = 88 s1 = 6 your exam 1 score 91

Exam 2 y2 = 88 s2 = 10 your exam 2 score 92

Which score is better

1

2

91 88 3z 5

6 692 88 4

z 410 10

91 on exam 1 is better than 92 on exam 2

If data has mean and standard deviation

then standardizing a particular value of

indicates how many standard deviations

is above or below the mean

y s

y

y

y

Comparing SAT and ACT Scores

SAT Math Eleanorrsquos score 680

SAT mean =500 sd=100 ACT Math Geraldrsquos score 27

ACT mean=18 sd=6 Eleanorrsquos z-score z=(680-500)100=18 Geraldrsquos z-score z=(27-18)6=15 Eleanorrsquos score is better

Z-scores add to zeroStudentInstitutional Support to Athletic Depts For the 9 Public ACC

Schools 2013 ($ millions)

School Support y - ybar Z-score

Maryland 155 64 179

UVA 131 40 112

Louisville 109 18 050

UNC 92 01 003

VaTech 79 -12 -034

FSU 79 -12 -034

GaTech 71 -20 -056

NCSU 65 -26 -073

Clemson 38 -53 -147

Mean=91000 s=35697

Sum = 0 Sum = 0

Recently the mean tuition at 4-yr public collegesuniversities in the US was $6185 with a standard deviation of $1804 In NC the mean tuition was $4320 What is NCrsquos z-score

1 103

2 -103

3 239

4 1865

5 -1865

Section 34Measures of Position (also called Measures of Relative Standing)

Quartiles

5-Number Summary

Interquartile Range Another Measure of Spread

Boxplots

m = median = 34

Q1= first quartile = 23

Q3= third quartile = 42

1 1 062 2 123 3 164 4 195 5 156 6 217 7 238 6 239 5 2510 4 2811 3 2912 2 3313 1 3414 2 3615 3 3716 4 3817 5 3918 6 4119 7 4220 6 4521 5 4722 4 4923 3 5324 2 5625 1 61

Quartiles Measuring spread by examining the middleThe first quartile Q1 is the value in the

sample that has 25 of the data at or

below it (Q1 is the median of the lower

half of the sorted data)

The third quartile Q3 is the value in the

sample that has 75 of the data at or

below it (Q3 is the median of the upper

half of the sorted data)

Quartiles and median divide data into 4 pieces

Q1 M Q3

14 14 14 14

Quartiles are common measures of spread

httpoirpncsueduiradmit

httpoirpncsueduunivpeer

University of Southern California

Economic Value of College Majors

Rules for Calculating QuartilesStep 1 find the median of all the data (the median divides the data in half)

Step 2a find the median of the lower half this median is Q1Step 2b find the median of the upper half this median is Q3

Importantwhen n is odd include the overall median in both halveswhen n is even do not include the overall median in either half

Example 2 4 6 8 10 12 14 16 18 20 n = 10

Median m = (10+12)2 = 222 = 11

Q1 median of lower half 2 4 6 8 10

Q1 = 6

Q3 median of upper half 12 14 16 18 20

Q3 = 16

11

Pulse Rates n = 138

Stem Leaves4

3 4 5889 5 00123344410 5 555678889923 6 0001111112223333334444423 6 5555666666777778888888816 7 0000011222233444423 7 5555566666677788888899910 8 000011222410 8 55556677894 9 00122 9 584 10 0223

101 11 1

Median mean of pulses in locations 69 amp 70 median= (70+70)2=70

Q1 median of lower half (lower half = 69 smallest pulses) Q1 = pulse in ordered position 35Q1 = 63

Q3 median of upper half (upper half = 69 largest pulses) Q3= pulse in position 35 from the high end Q3=78

Below are the weights of 31 linemen on the NCSU football team What is the

value of the first quartile Q1

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 287

2 2575

3 2635

4 2625

Interquartile range another measure of spread

lower quartile Q1

middle quartile median upper quartile Q3

interquartile range (IQR)

IQR = Q3 ndash Q1

measures spread of middle 50 of the data

Example beginning pulse rates

Q3 = 78 Q1 = 63

IQR = 78 ndash 63 = 15

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 11: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Percent of deaths from top 10 causes

Percent of deaths from

all causes

Make sure your labels match

the data

Make sure all percents

add up to 100

Internships

Basic bar chart Side-by-side bar chart

Trend Student Debt by State (grads of public 4 yr or more)

NewHam

pshir

e

Delawar

e

Minn

esot

a

South

Caroli

na

Alabam

a

Illino

is

Mon

tana

NewJe

rsey

India

na

Wes

tVirg

inia

Wisc

onsin

Idah

o

Kansa

s

Arkan

sas

Kentu

cky

Ore

gon

Nebra

ska

Colora

do

North

Caroli

na

Wyo

ming

Was

hingt

on

Florida

NewYor

k

Okla

hom

a

Califo

rnia

0

5000

10000

15000

20000

25000

30000

35000

40000

2009-10 2012-13 National Average2009-10 $216042012-13 $25043

Campbell University IncNew Life Theological Seminary

Meredith CollegeMid-Atlantic Christian University

Wake Forest UniversityMethodist University

Johnson C Smith UniversityChowan University

Catawba CollegeMars Hill College

Elon UniversityWingate University

Lenoir-Rhyne UniversityDavidson College

St Andrews Presbyterian CollegeDuke University

Belmont Abbey CollegeMean North Carolina - 4-year or above

Brevard CollegeWarren Wilson College

Mount Olive CollegeSalem College

Saint Augustines CollegeHigh Point University

0 20000 40000 60000

North Carolina Private Schools

Tuition and fees (in-state) Average debt of graduates

UNC Greensboro

UNC School of the Arts

NC A amp T

Mean North Carolina - 4-year or above

NCSU

UNC-Wilmington

UNC Charlotte

ECU

Appalachian

UNC Asheville

Elizabeth City

0 5000 10000 15000 20000 25000

North Carolina Public Schools

Tuition and fees (in-state) Average debt of graduates

Student Debt North Carolina Schools

Unnecessary dimension in a pie chart

3rd dimension is unnecessary the 3D pie chart does not convey any more information than a 2D pie chart

Section 31 continuedDisplaying Quantitative Data

Histograms

Stem and Leaf Displays

Frequency HistogramsBAKER CITY HOSPITAL - LENGTH OF STAY

DISTRIBUTION

0

10

20

30

40

50

60

70

0lt2 2lt4 4lt6 6lt8 8lt10 10lt12 12lt14 14lt16 16lt18

Relative Frequency Histogram of Exam Grades

005

10

15

20

25

30

40 50 60 70 80 90Grade

Rel

ativ

e fr

eque

ncy

100

Histograms

A histogram shows three general types of information

It provides visual indication of where the approximate center of the data is

We can gain an understanding of the degree of spread or variation in the data

We can observe the shape of the distribution

Histograms Showing Different Centers

0

10

20

30

40

50

60

70

0lt2 2lt4 4lt6 6lt8 8lt10 10lt12 12lt14 14lt16 16lt18

0

10

20

30

40

50

60

70

0lt2 2lt4 4lt6 6lt8 8lt10 10lt12 12lt14 14lt16 16lt18

Histograms - Same Center Different Spread

0

10

20

30

40

50

60

70

0lt2

2lt4

4lt6

6lt8

8lt10

10lt12

12lt14

14lt16

16lt18

0

10

20

30

40

50

60

70

0lt2 2lt4 4lt6 6lt8 8lt10 10lt12 12lt14 14lt16 16lt18

Histograms Shape

A distribution is symmetric if the right and left

sides of the histogram are approximately mirror

images of each other

Symmetric distribution

Complex multimodal distribution

Not all distributions have a simple overall shape

especially when there are few observations

Skewed distribution

A distribution is skewed to the right if the right

side of the histogram (side with larger values)

extends much farther out than the left side It is

skewed to the left if the left side of the histogram

extends much farther out than the right side

Shape (cont)Female heart attack patients in New York state

Age left-skewed Cost right-skewed

Shape (cont) outliersAll 200 m Races 202 secs or less

192 1926193219381944 195 1956196219681974 198 1986199219982004 201 20160

10

20

30

40

50

60

200 m Races 202 secs or less (approx 700)

TIMES

Fre

qu

ency Usain Bolt

2008 1930Michael Johnson1996 1932

Alaska Florida

Shape (cont) Outliers

An important kind of deviation is an outlier Outliers are observations

that lie outside the overall pattern of a distribution Always look for

outliers and try to explain them

The overall pattern is fairly

symmetrical except for 2

states clearly not belonging

to the main trend Alaska

and Florida have unusual

representation of the

elderly in their population

A large gap in the

distribution is typically a

sign of an outlier

Excel Example 2012-13 NFL Salaries

3694

80

1273

609

231

2177

738

462

3081

867

692

3985

996

923

4890

126

154

5794

255

385

6698

384

615

7602

513

846

8506

643

077

9410

772

308

1031

4901

54

1121

9030

77

1212

3160

1302

7289

23

1393

1418

46

1483

5547

69

1573

9676

92

1664

3806

15

1754

7935

38

0

100

200

300

400

500

600

700

800

900

1000

Histogram

Bin

Fre

qu

ency

Statcrunch Example 2012-13 NFL Salaries

Heights of Students in Recent Stats Class (Bimodal)

ExampleGrades on a statistics exam

Data

75 66 77 66 64 73 91 65 59 86 61 86 61

58 70 77 80 58 94 78 62 79 83 54 52 45

82 48 67 55

Example-2Frequency Distribution of Grades

Class Limits Frequency40 up to 50

50 up to 60

60 up to 70

70 up to 80

80 up to 90

90 up to 100

Total

2

6

8

7

5

2

30

Example-3 Relative Frequency Distribution of Grades

Class Limits Relative Frequency40 up to 50

50 up to 60

60 up to 70

70 up to 80

80 up to 90

90 up to 100

230 = 067

630 = 200

830 = 267

730 = 233

530 = 167

230 = 067

Relative Frequency Histogram of Grades

005

10

15

20

25

30

40 50 60 70 80 90Grade

Rel

ativ

e fr

eque

ncy

100

Based on the histo-gram about what percent of the values are between 475 and 525

1 50

2 5

3 17

4 30

Stem and leaf displays Have the following general appearance

stem leaf

1 8 9

2 1 2 8 9 9

3 2 3 8 9

4 0 1

5 6 7

6 4

Example employee ages at a small company

18 21 22 19 32 33 40 41 56 57 64 28 29 29 38 39 stem 10rsquos digit leaf 1rsquos digit

18 stem=1 leaf=8 18 = 1 | 8

stem leaf

1 8 9

2 1 2 8 9 9

3 2 3 8 9

4 0 1

5 6 7

6 4

Suppose a 95 yr old is hiredstem leaf

1 8 9

2 1 2 8 9 9

3 2 3 8 9

4 0 1

5 6 7

6 4

7

8

9 5

Number of TD passes by NFL teams 2012-2013 season(stems are 10rsquos digit)

stem leaf

43

03247

2 6677789

2 01222233444

1 13467889

0 8

Pulse Rates n = 138

Stem Leaves 4 3 4 588 9 5 001233444 10 5 5556788899 23 6 00011111122233333344444 23 6 55556666667777788888888 16 7 00000112222334444 23 7 55555666666777888888999 10 8 0000112224 10 8 5555667789 4 9 0012 2 9 58 4 10 0223 10 1 11 1

AdvantagesDisadvantages of Stem-and-Leaf Displays

Advantages

1) each measurement displayed

2) ascending order in each stem row

3) relatively simple (data set not too large) Disadvantages

display becomes unwieldy for large data sets

Population of 185 US cities with between 100000 and 500000

Multiply stems by 100000

Back-to-back stem-and-leaf displays TD passes by NFL teams 1999-2000 2012-13multiply stems by 10

1999-2000 2012-13

2 4 03

6 3 7

2 3 24

6655 2 6677789

43322221100 2 01222233444

9998887666 1 67889

421 1 134

0 8

Below is a stem-and-leaf display for the pulse rates of 24 women at a health clinic How many pulses are between 67 and 77

Stems are 10rsquos digits

1 4

2 6

3 8

4 10

5 12

Other Graphical Methods for Data Time plots

plot observations in time order time on horizontal axis variable on vertical axis

Time series

measurements are taken at regular intervals (monthly unemployment quarterly GDP weather records electricity demand etc)

Heat maps word walls

Unemployment Rate by Educational Attainment

Water Use During Super Bowl XLV(Packers 31 Steelers 25)

Heat Maps

Word Wall (customer feedback)

Section 32Describing the Center of Data

Mean

Median

2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability (next section)

measures how ldquospread outrdquo the data is

Notation for Data Valuesand Sample Mean

1 2

1 2

3

The sample size is denoted by

For a variable denoted by its observations are denoted by

A common measure of center is the sample mean

The sample mean is denoted by

Shorte

n

n

y y yy

n

y

y y y y

y

n

1 21

1

ned expression for using the symbol

(uppercase Greek letter sigma)n

n

i

i n

i

i

y

y y y

yy

n

y

Simple Example of Sample Mean

Weekly TV viewing time in hours of 7 randomly selected 4th graders

19 40 16 12 10 6 and 97

1

7

1

19 40 16 12 10 6 9 112

11216

7 7

ii

ii

y

yy

Population Mean

1

population

population mea

Denoted by the Greek letter

is the size (for example =34000 for NCSU)

the value of is typically not known

we often use the sample mean

to estimat

n

e the unknown

N

ii

y

N N

y

N

value of

Connection Between Mean and Histogram

A histogram balances when supported at the mean Mean x = 1406

Histogram

0

10

20

30

40

50

60

70

118

5

125

5

132

5

139

5

146

5

153

5

16

05

Mo

re

Absences f rom Work

Fre

qu

en

cy

Frequency

The median anothermeasure of center

Given a set of n data values arranged in order of magnitude

Median= middle value n odd

mean of 2 middle values n even

Ex 2 4 6 8 10 n=5 median=6 Ex 2 4 6 8 n=4 median=(4+6)2=5

Student Pulse Rates (n=62)

38 59 60 60 62 62 63 63 64 64 65 67 68 70 70 70 70 70 70 70 71 71 72 72 73 74 74 75 75 75 75 76 77 77 77 77 78 78 79 79 80 80 80 84 84 85 85 87 90 90 91 92 93 94 94 95 96 96 96 98 98 103

Median = (75+76)2 = 755

The median splits the histogram into 2 halves of equal area

Mean balance pointMedian 50 area each half

mean 5526 years median 577years

Medians are used often

Year 2011 baseball salaries

Median $1450000 (max=$32000000 Alex Rodriguez min=$414000)

Median fan age MLB 45 NFL 43 NBA 41 NHL 39

Median existing home sales price May 2011 $166500 May 2010 $174600

Median household income (2008 dollars) 2009 $50221 2008 $52029

Examples Example n = 7

175 28 32 139 141 253 458 Example n = 7 (ordered) 28 32 139 141 175 253 458 Example n = 8

175 28 32 139 141 253 357 458

Example n =8 (ordered)

28 32 139 141 175 253 357 458

m = 141

m = (141+175)2 = 158

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496049604971524555467586

1 5245

2 49655

3 4960

4 4971

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496052455546497155877586

1 5245

2 49655

3 5546

4 4971

Properties of Mean Median1The mean and median are unique that is a

data set has only 1 mean and 1 median (the mean and median are not necessarily equal)

2The mean uses the value of every number in the data set the median does not

14

20 4 6Ex 2 4 6 8 5 5

4 2

21 4 6Ex 2 4 6 9 5 5

4 2

x m

x m

Example class pulse rates

53 64 67 67 70 76 77 77 78 83 84 85 85 89 90 90 90 90 91 96 98 103 140

23

1

23

844823

location 12th obs 85

ii

n

xx

m m

2010 2014 baseball salaries

2010

n = 845

mean = $3297828

median = $1330000

max = $33000000

2014

n = 848

mean = $3932912

median = $1456250

max = $28000000

>

Disadvantage of the mean

Can be greatly influenced by just a few observations that are much greater or much smaller than the rest of the data

Mean Median Maximum Baseball Salaries 1985 - 201419

85

1987

1989

1991

1993

1995

1997

1999

2001

2003

2005

2007

2009

2011

2013

200000

700000

1200000

1700000

2200000

2700000

3200000

3700000

0

5000000

10000000

15000000

20000000

25000000

30000000

35000000

Baseball Salaries Mean Median and Maximum 1985-2014

Mean Median Maximum

Year

Mea

n M

edia

n S

alar

y

Max

imu

m S

alar

y

Skewness comparing the mean and median

Skewed to the right (positively skewed) meangtmedian

53

490

102 7235 21 26 17 8 10 2 3 1 0 0 1

0

100

200

300

400

500

600

Freq

uenc

y

Salary ($1000s)

2011 Baseball Salaries

Skewed to the left negatively skewed

Mean lt median mean=78 median=87

Histogram of Exam Scores

0

10

20

30

20 30 40 50 60 70 80 90 100Exam Scores

Fre

qu

en

cy

Symmetric data

mean median approx equal

Bank Customers 1000-1100 am

0

5

10

15

20

Number of Customers

Fre

qu

en

cy

Section 33Describing Variability of Data

Standard Deviation

Using the Mean and Standard Deviation Together 68-95-997

Rule (Empirical Rule)

Recall 2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability

measures how ldquospread outrdquo the data is

Ways to measure variability

1 range=largest-smallest

ok sometimes in general too crude sensitive to one large or small obs

1

2 where

the middle is the mean

deviation of from the mean

( ) sum the deviations of all the s from

measure spread from the middle

i i

n

i ii

y

y y y

y y y y

1

( ) 0 always tells us nothingn

ii

y y

Example

1 2

1 2

1 2

1 2

sum of deviations from mean

49 51 50

( ) ( ) (49 50) (51 50) 1 1 0

0 100

Data set 1

Data set 2 50

( ) ( ) (0 50) (100 50) 50 50 0

x x x

x x x x

y y y

y y y y

The Sample Standard Deviation a measure of spread around the mean Square the deviation of each

observation from the mean find the square root of the ldquoaveragerdquo of these squared deviations

2

1

2

2 1

( )sample standard deviation

1

( )is called the sample variance

1

n

ii

n

ii

y ys

n

y ys

n

Calculations hellip

Mean = 634

Sum of squared deviations from mean = 852

(n minus 1) = 13 (n minus 1) is called degrees freedom (df)

s2 = variance = 85213 = 655 square inches

s = standard deviation = radic655 = 256 inches

Women height (inches)i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

2

1

2 )(1

1xx

ns

n

i

1 First calculate the variance s22 Then take the square root to get the

standard deviation s

2

1

)(1

1xx

ns

n

i

Meanplusmn 1 sd

Wersquoll never calculate these by hand so make sure to know how to get the standard deviation using your calculator Excel or other software

Population Standard Deviation

2

1

Denoted by the lower case Greek letter

is the size (for example =34000 for NCSU)

is the mean

( )population standard deviation

va

po

lue of typically not known

us

pulation

populatio

e

n

N

ii

N N

y

N

s

to estimate value of

Remarks

1 The standard deviation of a set of measurements is an estimate of the likely size of the chance error in a single measurement

Remarks (cont)

2 Note that s and s are always greater than or equal to zero

3 The larger the value of s (or s ) the greater the spread of the data

When does s=0 When does s =0

When all data values are the same

Remarks (cont)4 The standard deviation is the most

commonly used measure of risk in finance and businessndash Stocks Mutual Funds etc

5 Variance s2 sample variance 2 population variance Units are squared units of the original data square $ square gallons

Review Properties of s and s s and s are always greater than or

equal to 0

when does s = 0 s = 0 The larger the value of s (or s) the

greater the spread of the data the standard deviation of a set of

measurements is an estimate of the likely size of the chance error in a single measurement

Summary of Notation

2

SAMPLE

sample mean

sample median

sample variance

sample stand dev

y

m

s

s

2

POPULATION

population mean

population median

population variance

population stand dev

m

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

68-95-997 rule

Mean andStandard Deviation

(numerical)

Histogram(graphical)

68-95-997 rule

The 68-95-997 ruleIf the histogram of the data is

approximately bell-shaped then1) approximately of the measurements

are of the mean

that is in ( )

2) approximately of the measurement

68

within 1 standard deviation

95

within 2 standard deviation

s

are of the meas n

that is

y s y s

almost all

within 3 standard deviation

in ( 2 2 )

3) the measurements

are of the mean

that is in ( 3 3 )

s

y s y s

y s y s

68-95-997 rule 68 within 1 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

68

3434

y-s y y+s

68-95-997 rule 95 within 2 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

95

475 475

y-2s y y+2s

Example textbook costs

37548

4272

50

y

s

n

286 291 307 308 315 316 327328 340 342 346 347 348 348 349354 355 355 360 361 364 367 369371 373 377 380 381 382 385 385387 390 390 397 398 409 409 410418 422 424 425 426 428 433 434437 440 480

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( ) (33276 41820)

32percentage of data values in this interval 64

5068-95-997 rule 68

y s

y s y s

1 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 2 2 ) (29004 46092)

48percentage of data values in this interval 96

5068-95-997 rule 95

y s

y s y s

2 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 3 3 ) (24732 50364)

50percentage of data values in this interval 100

5068-95-997 rule 997

y s

y s y s

3 standard deviation interval about the mean

The best estimate of the standard deviation of the menrsquos weights

displayed in this dotplot is

1 10

2 15

3 20

4 40

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

Preceding slides Next

Z-scores Standardized Data Values

Measures the distance of a number from the mean in units of

the standard deviation

z-score corresponding to y

where

original data value

the sample mean

s the sample standard deviation

the z-score corresponding to

y yz

s

y

y

z y

Exam 1 y1 = 88 s1 = 6 your exam 1 score 91

Exam 2 y2 = 88 s2 = 10 your exam 2 score 92

Which score is better

1

2

91 88 3z 5

6 692 88 4

z 410 10

91 on exam 1 is better than 92 on exam 2

If data has mean and standard deviation

then standardizing a particular value of

indicates how many standard deviations

is above or below the mean

y s

y

y

y

Comparing SAT and ACT Scores

SAT Math Eleanorrsquos score 680

SAT mean =500 sd=100 ACT Math Geraldrsquos score 27

ACT mean=18 sd=6 Eleanorrsquos z-score z=(680-500)100=18 Geraldrsquos z-score z=(27-18)6=15 Eleanorrsquos score is better

Z-scores add to zeroStudentInstitutional Support to Athletic Depts For the 9 Public ACC

Schools 2013 ($ millions)

School Support y - ybar Z-score

Maryland 155 64 179

UVA 131 40 112

Louisville 109 18 050

UNC 92 01 003

VaTech 79 -12 -034

FSU 79 -12 -034

GaTech 71 -20 -056

NCSU 65 -26 -073

Clemson 38 -53 -147

Mean=91000 s=35697

Sum = 0 Sum = 0

Recently the mean tuition at 4-yr public collegesuniversities in the US was $6185 with a standard deviation of $1804 In NC the mean tuition was $4320 What is NCrsquos z-score

1 103

2 -103

3 239

4 1865

5 -1865

Section 34Measures of Position (also called Measures of Relative Standing)

Quartiles

5-Number Summary

Interquartile Range Another Measure of Spread

Boxplots

m = median = 34

Q1= first quartile = 23

Q3= third quartile = 42

1 1 062 2 123 3 164 4 195 5 156 6 217 7 238 6 239 5 2510 4 2811 3 2912 2 3313 1 3414 2 3615 3 3716 4 3817 5 3918 6 4119 7 4220 6 4521 5 4722 4 4923 3 5324 2 5625 1 61

Quartiles Measuring spread by examining the middleThe first quartile Q1 is the value in the

sample that has 25 of the data at or

below it (Q1 is the median of the lower

half of the sorted data)

The third quartile Q3 is the value in the

sample that has 75 of the data at or

below it (Q3 is the median of the upper

half of the sorted data)

Quartiles and median divide data into 4 pieces

Q1 M Q3

14 14 14 14

Quartiles are common measures of spread

httpoirpncsueduiradmit

httpoirpncsueduunivpeer

University of Southern California

Economic Value of College Majors

Rules for Calculating QuartilesStep 1 find the median of all the data (the median divides the data in half)

Step 2a find the median of the lower half this median is Q1Step 2b find the median of the upper half this median is Q3

Importantwhen n is odd include the overall median in both halveswhen n is even do not include the overall median in either half

Example 2 4 6 8 10 12 14 16 18 20 n = 10

Median m = (10+12)2 = 222 = 11

Q1 median of lower half 2 4 6 8 10

Q1 = 6

Q3 median of upper half 12 14 16 18 20

Q3 = 16

11

Pulse Rates n = 138

Stem Leaves4

3 4 5889 5 00123344410 5 555678889923 6 0001111112223333334444423 6 5555666666777778888888816 7 0000011222233444423 7 5555566666677788888899910 8 000011222410 8 55556677894 9 00122 9 584 10 0223

101 11 1

Median mean of pulses in locations 69 amp 70 median= (70+70)2=70

Q1 median of lower half (lower half = 69 smallest pulses) Q1 = pulse in ordered position 35Q1 = 63

Q3 median of upper half (upper half = 69 largest pulses) Q3= pulse in position 35 from the high end Q3=78

Below are the weights of 31 linemen on the NCSU football team What is the

value of the first quartile Q1

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 287

2 2575

3 2635

4 2625

Interquartile range another measure of spread

lower quartile Q1

middle quartile median upper quartile Q3

interquartile range (IQR)

IQR = Q3 ndash Q1

measures spread of middle 50 of the data

Example beginning pulse rates

Q3 = 78 Q1 = 63

IQR = 78 ndash 63 = 15

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 12: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Internships

Basic bar chart Side-by-side bar chart

Trend Student Debt by State (grads of public 4 yr or more)

NewHam

pshir

e

Delawar

e

Minn

esot

a

South

Caroli

na

Alabam

a

Illino

is

Mon

tana

NewJe

rsey

India

na

Wes

tVirg

inia

Wisc

onsin

Idah

o

Kansa

s

Arkan

sas

Kentu

cky

Ore

gon

Nebra

ska

Colora

do

North

Caroli

na

Wyo

ming

Was

hingt

on

Florida

NewYor

k

Okla

hom

a

Califo

rnia

0

5000

10000

15000

20000

25000

30000

35000

40000

2009-10 2012-13 National Average2009-10 $216042012-13 $25043

Campbell University IncNew Life Theological Seminary

Meredith CollegeMid-Atlantic Christian University

Wake Forest UniversityMethodist University

Johnson C Smith UniversityChowan University

Catawba CollegeMars Hill College

Elon UniversityWingate University

Lenoir-Rhyne UniversityDavidson College

St Andrews Presbyterian CollegeDuke University

Belmont Abbey CollegeMean North Carolina - 4-year or above

Brevard CollegeWarren Wilson College

Mount Olive CollegeSalem College

Saint Augustines CollegeHigh Point University

0 20000 40000 60000

North Carolina Private Schools

Tuition and fees (in-state) Average debt of graduates

UNC Greensboro

UNC School of the Arts

NC A amp T

Mean North Carolina - 4-year or above

NCSU

UNC-Wilmington

UNC Charlotte

ECU

Appalachian

UNC Asheville

Elizabeth City

0 5000 10000 15000 20000 25000

North Carolina Public Schools

Tuition and fees (in-state) Average debt of graduates

Student Debt North Carolina Schools

Unnecessary dimension in a pie chart

3rd dimension is unnecessary the 3D pie chart does not convey any more information than a 2D pie chart

Section 31 continuedDisplaying Quantitative Data

Histograms

Stem and Leaf Displays

Frequency HistogramsBAKER CITY HOSPITAL - LENGTH OF STAY

DISTRIBUTION

0

10

20

30

40

50

60

70

0lt2 2lt4 4lt6 6lt8 8lt10 10lt12 12lt14 14lt16 16lt18

Relative Frequency Histogram of Exam Grades

005

10

15

20

25

30

40 50 60 70 80 90Grade

Rel

ativ

e fr

eque

ncy

100

Histograms

A histogram shows three general types of information

It provides visual indication of where the approximate center of the data is

We can gain an understanding of the degree of spread or variation in the data

We can observe the shape of the distribution

Histograms Showing Different Centers

0

10

20

30

40

50

60

70

0lt2 2lt4 4lt6 6lt8 8lt10 10lt12 12lt14 14lt16 16lt18

0

10

20

30

40

50

60

70

0lt2 2lt4 4lt6 6lt8 8lt10 10lt12 12lt14 14lt16 16lt18

Histograms - Same Center Different Spread

0

10

20

30

40

50

60

70

0lt2

2lt4

4lt6

6lt8

8lt10

10lt12

12lt14

14lt16

16lt18

0

10

20

30

40

50

60

70

0lt2 2lt4 4lt6 6lt8 8lt10 10lt12 12lt14 14lt16 16lt18

Histograms Shape

A distribution is symmetric if the right and left

sides of the histogram are approximately mirror

images of each other

Symmetric distribution

Complex multimodal distribution

Not all distributions have a simple overall shape

especially when there are few observations

Skewed distribution

A distribution is skewed to the right if the right

side of the histogram (side with larger values)

extends much farther out than the left side It is

skewed to the left if the left side of the histogram

extends much farther out than the right side

Shape (cont)Female heart attack patients in New York state

Age left-skewed Cost right-skewed

Shape (cont) outliersAll 200 m Races 202 secs or less

192 1926193219381944 195 1956196219681974 198 1986199219982004 201 20160

10

20

30

40

50

60

200 m Races 202 secs or less (approx 700)

TIMES

Fre

qu

ency Usain Bolt

2008 1930Michael Johnson1996 1932

Alaska Florida

Shape (cont) Outliers

An important kind of deviation is an outlier Outliers are observations

that lie outside the overall pattern of a distribution Always look for

outliers and try to explain them

The overall pattern is fairly

symmetrical except for 2

states clearly not belonging

to the main trend Alaska

and Florida have unusual

representation of the

elderly in their population

A large gap in the

distribution is typically a

sign of an outlier

Excel Example 2012-13 NFL Salaries

3694

80

1273

609

231

2177

738

462

3081

867

692

3985

996

923

4890

126

154

5794

255

385

6698

384

615

7602

513

846

8506

643

077

9410

772

308

1031

4901

54

1121

9030

77

1212

3160

1302

7289

23

1393

1418

46

1483

5547

69

1573

9676

92

1664

3806

15

1754

7935

38

0

100

200

300

400

500

600

700

800

900

1000

Histogram

Bin

Fre

qu

ency

Statcrunch Example 2012-13 NFL Salaries

Heights of Students in Recent Stats Class (Bimodal)

ExampleGrades on a statistics exam

Data

75 66 77 66 64 73 91 65 59 86 61 86 61

58 70 77 80 58 94 78 62 79 83 54 52 45

82 48 67 55

Example-2Frequency Distribution of Grades

Class Limits Frequency40 up to 50

50 up to 60

60 up to 70

70 up to 80

80 up to 90

90 up to 100

Total

2

6

8

7

5

2

30

Example-3 Relative Frequency Distribution of Grades

Class Limits Relative Frequency40 up to 50

50 up to 60

60 up to 70

70 up to 80

80 up to 90

90 up to 100

230 = 067

630 = 200

830 = 267

730 = 233

530 = 167

230 = 067

Relative Frequency Histogram of Grades

005

10

15

20

25

30

40 50 60 70 80 90Grade

Rel

ativ

e fr

eque

ncy

100

Based on the histo-gram about what percent of the values are between 475 and 525

1 50

2 5

3 17

4 30

Stem and leaf displays Have the following general appearance

stem leaf

1 8 9

2 1 2 8 9 9

3 2 3 8 9

4 0 1

5 6 7

6 4

Example employee ages at a small company

18 21 22 19 32 33 40 41 56 57 64 28 29 29 38 39 stem 10rsquos digit leaf 1rsquos digit

18 stem=1 leaf=8 18 = 1 | 8

stem leaf

1 8 9

2 1 2 8 9 9

3 2 3 8 9

4 0 1

5 6 7

6 4

Suppose a 95 yr old is hiredstem leaf

1 8 9

2 1 2 8 9 9

3 2 3 8 9

4 0 1

5 6 7

6 4

7

8

9 5

Number of TD passes by NFL teams 2012-2013 season(stems are 10rsquos digit)

stem leaf

43

03247

2 6677789

2 01222233444

1 13467889

0 8

Pulse Rates n = 138

Stem Leaves 4 3 4 588 9 5 001233444 10 5 5556788899 23 6 00011111122233333344444 23 6 55556666667777788888888 16 7 00000112222334444 23 7 55555666666777888888999 10 8 0000112224 10 8 5555667789 4 9 0012 2 9 58 4 10 0223 10 1 11 1

AdvantagesDisadvantages of Stem-and-Leaf Displays

Advantages

1) each measurement displayed

2) ascending order in each stem row

3) relatively simple (data set not too large) Disadvantages

display becomes unwieldy for large data sets

Population of 185 US cities with between 100000 and 500000

Multiply stems by 100000

Back-to-back stem-and-leaf displays TD passes by NFL teams 1999-2000 2012-13multiply stems by 10

1999-2000 2012-13

2 4 03

6 3 7

2 3 24

6655 2 6677789

43322221100 2 01222233444

9998887666 1 67889

421 1 134

0 8

Below is a stem-and-leaf display for the pulse rates of 24 women at a health clinic How many pulses are between 67 and 77

Stems are 10rsquos digits

1 4

2 6

3 8

4 10

5 12

Other Graphical Methods for Data Time plots

plot observations in time order time on horizontal axis variable on vertical axis

Time series

measurements are taken at regular intervals (monthly unemployment quarterly GDP weather records electricity demand etc)

Heat maps word walls

Unemployment Rate by Educational Attainment

Water Use During Super Bowl XLV(Packers 31 Steelers 25)

Heat Maps

Word Wall (customer feedback)

Section 32Describing the Center of Data

Mean

Median

2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability (next section)

measures how ldquospread outrdquo the data is

Notation for Data Valuesand Sample Mean

1 2

1 2

3

The sample size is denoted by

For a variable denoted by its observations are denoted by

A common measure of center is the sample mean

The sample mean is denoted by

Shorte

n

n

y y yy

n

y

y y y y

y

n

1 21

1

ned expression for using the symbol

(uppercase Greek letter sigma)n

n

i

i n

i

i

y

y y y

yy

n

y

Simple Example of Sample Mean

Weekly TV viewing time in hours of 7 randomly selected 4th graders

19 40 16 12 10 6 and 97

1

7

1

19 40 16 12 10 6 9 112

11216

7 7

ii

ii

y

yy

Population Mean

1

population

population mea

Denoted by the Greek letter

is the size (for example =34000 for NCSU)

the value of is typically not known

we often use the sample mean

to estimat

n

e the unknown

N

ii

y

N N

y

N

value of

Connection Between Mean and Histogram

A histogram balances when supported at the mean Mean x = 1406

Histogram

0

10

20

30

40

50

60

70

118

5

125

5

132

5

139

5

146

5

153

5

16

05

Mo

re

Absences f rom Work

Fre

qu

en

cy

Frequency

The median anothermeasure of center

Given a set of n data values arranged in order of magnitude

Median= middle value n odd

mean of 2 middle values n even

Ex 2 4 6 8 10 n=5 median=6 Ex 2 4 6 8 n=4 median=(4+6)2=5

Student Pulse Rates (n=62)

38 59 60 60 62 62 63 63 64 64 65 67 68 70 70 70 70 70 70 70 71 71 72 72 73 74 74 75 75 75 75 76 77 77 77 77 78 78 79 79 80 80 80 84 84 85 85 87 90 90 91 92 93 94 94 95 96 96 96 98 98 103

Median = (75+76)2 = 755

The median splits the histogram into 2 halves of equal area

Mean balance pointMedian 50 area each half

mean 5526 years median 577years

Medians are used often

Year 2011 baseball salaries

Median $1450000 (max=$32000000 Alex Rodriguez min=$414000)

Median fan age MLB 45 NFL 43 NBA 41 NHL 39

Median existing home sales price May 2011 $166500 May 2010 $174600

Median household income (2008 dollars) 2009 $50221 2008 $52029

Examples Example n = 7

175 28 32 139 141 253 458 Example n = 7 (ordered) 28 32 139 141 175 253 458 Example n = 8

175 28 32 139 141 253 357 458

Example n =8 (ordered)

28 32 139 141 175 253 357 458

m = 141

m = (141+175)2 = 158

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496049604971524555467586

1 5245

2 49655

3 4960

4 4971

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496052455546497155877586

1 5245

2 49655

3 5546

4 4971

Properties of Mean Median1The mean and median are unique that is a

data set has only 1 mean and 1 median (the mean and median are not necessarily equal)

2The mean uses the value of every number in the data set the median does not

14

20 4 6Ex 2 4 6 8 5 5

4 2

21 4 6Ex 2 4 6 9 5 5

4 2

x m

x m

Example class pulse rates

53 64 67 67 70 76 77 77 78 83 84 85 85 89 90 90 90 90 91 96 98 103 140

23

1

23

844823

location 12th obs 85

ii

n

xx

m m

2010 2014 baseball salaries

2010

n = 845

mean = $3297828

median = $1330000

max = $33000000

2014

n = 848

mean = $3932912

median = $1456250

max = $28000000

>

Disadvantage of the mean

Can be greatly influenced by just a few observations that are much greater or much smaller than the rest of the data

Mean Median Maximum Baseball Salaries 1985 - 201419

85

1987

1989

1991

1993

1995

1997

1999

2001

2003

2005

2007

2009

2011

2013

200000

700000

1200000

1700000

2200000

2700000

3200000

3700000

0

5000000

10000000

15000000

20000000

25000000

30000000

35000000

Baseball Salaries Mean Median and Maximum 1985-2014

Mean Median Maximum

Year

Mea

n M

edia

n S

alar

y

Max

imu

m S

alar

y

Skewness comparing the mean and median

Skewed to the right (positively skewed) meangtmedian

53

490

102 7235 21 26 17 8 10 2 3 1 0 0 1

0

100

200

300

400

500

600

Freq

uenc

y

Salary ($1000s)

2011 Baseball Salaries

Skewed to the left negatively skewed

Mean lt median mean=78 median=87

Histogram of Exam Scores

0

10

20

30

20 30 40 50 60 70 80 90 100Exam Scores

Fre

qu

en

cy

Symmetric data

mean median approx equal

Bank Customers 1000-1100 am

0

5

10

15

20

Number of Customers

Fre

qu

en

cy

Section 33Describing Variability of Data

Standard Deviation

Using the Mean and Standard Deviation Together 68-95-997

Rule (Empirical Rule)

Recall 2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability

measures how ldquospread outrdquo the data is

Ways to measure variability

1 range=largest-smallest

ok sometimes in general too crude sensitive to one large or small obs

1

2 where

the middle is the mean

deviation of from the mean

( ) sum the deviations of all the s from

measure spread from the middle

i i

n

i ii

y

y y y

y y y y

1

( ) 0 always tells us nothingn

ii

y y

Example

1 2

1 2

1 2

1 2

sum of deviations from mean

49 51 50

( ) ( ) (49 50) (51 50) 1 1 0

0 100

Data set 1

Data set 2 50

( ) ( ) (0 50) (100 50) 50 50 0

x x x

x x x x

y y y

y y y y

The Sample Standard Deviation a measure of spread around the mean Square the deviation of each

observation from the mean find the square root of the ldquoaveragerdquo of these squared deviations

2

1

2

2 1

( )sample standard deviation

1

( )is called the sample variance

1

n

ii

n

ii

y ys

n

y ys

n

Calculations hellip

Mean = 634

Sum of squared deviations from mean = 852

(n minus 1) = 13 (n minus 1) is called degrees freedom (df)

s2 = variance = 85213 = 655 square inches

s = standard deviation = radic655 = 256 inches

Women height (inches)i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

2

1

2 )(1

1xx

ns

n

i

1 First calculate the variance s22 Then take the square root to get the

standard deviation s

2

1

)(1

1xx

ns

n

i

Meanplusmn 1 sd

Wersquoll never calculate these by hand so make sure to know how to get the standard deviation using your calculator Excel or other software

Population Standard Deviation

2

1

Denoted by the lower case Greek letter

is the size (for example =34000 for NCSU)

is the mean

( )population standard deviation

va

po

lue of typically not known

us

pulation

populatio

e

n

N

ii

N N

y

N

s

to estimate value of

Remarks

1 The standard deviation of a set of measurements is an estimate of the likely size of the chance error in a single measurement

Remarks (cont)

2 Note that s and s are always greater than or equal to zero

3 The larger the value of s (or s ) the greater the spread of the data

When does s=0 When does s =0

When all data values are the same

Remarks (cont)4 The standard deviation is the most

commonly used measure of risk in finance and businessndash Stocks Mutual Funds etc

5 Variance s2 sample variance 2 population variance Units are squared units of the original data square $ square gallons

Review Properties of s and s s and s are always greater than or

equal to 0

when does s = 0 s = 0 The larger the value of s (or s) the

greater the spread of the data the standard deviation of a set of

measurements is an estimate of the likely size of the chance error in a single measurement

Summary of Notation

2

SAMPLE

sample mean

sample median

sample variance

sample stand dev

y

m

s

s

2

POPULATION

population mean

population median

population variance

population stand dev

m

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

68-95-997 rule

Mean andStandard Deviation

(numerical)

Histogram(graphical)

68-95-997 rule

The 68-95-997 ruleIf the histogram of the data is

approximately bell-shaped then1) approximately of the measurements

are of the mean

that is in ( )

2) approximately of the measurement

68

within 1 standard deviation

95

within 2 standard deviation

s

are of the meas n

that is

y s y s

almost all

within 3 standard deviation

in ( 2 2 )

3) the measurements

are of the mean

that is in ( 3 3 )

s

y s y s

y s y s

68-95-997 rule 68 within 1 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

68

3434

y-s y y+s

68-95-997 rule 95 within 2 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

95

475 475

y-2s y y+2s

Example textbook costs

37548

4272

50

y

s

n

286 291 307 308 315 316 327328 340 342 346 347 348 348 349354 355 355 360 361 364 367 369371 373 377 380 381 382 385 385387 390 390 397 398 409 409 410418 422 424 425 426 428 433 434437 440 480

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( ) (33276 41820)

32percentage of data values in this interval 64

5068-95-997 rule 68

y s

y s y s

1 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 2 2 ) (29004 46092)

48percentage of data values in this interval 96

5068-95-997 rule 95

y s

y s y s

2 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 3 3 ) (24732 50364)

50percentage of data values in this interval 100

5068-95-997 rule 997

y s

y s y s

3 standard deviation interval about the mean

The best estimate of the standard deviation of the menrsquos weights

displayed in this dotplot is

1 10

2 15

3 20

4 40

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

Preceding slides Next

Z-scores Standardized Data Values

Measures the distance of a number from the mean in units of

the standard deviation

z-score corresponding to y

where

original data value

the sample mean

s the sample standard deviation

the z-score corresponding to

y yz

s

y

y

z y

Exam 1 y1 = 88 s1 = 6 your exam 1 score 91

Exam 2 y2 = 88 s2 = 10 your exam 2 score 92

Which score is better

1

2

91 88 3z 5

6 692 88 4

z 410 10

91 on exam 1 is better than 92 on exam 2

If data has mean and standard deviation

then standardizing a particular value of

indicates how many standard deviations

is above or below the mean

y s

y

y

y

Comparing SAT and ACT Scores

SAT Math Eleanorrsquos score 680

SAT mean =500 sd=100 ACT Math Geraldrsquos score 27

ACT mean=18 sd=6 Eleanorrsquos z-score z=(680-500)100=18 Geraldrsquos z-score z=(27-18)6=15 Eleanorrsquos score is better

Z-scores add to zeroStudentInstitutional Support to Athletic Depts For the 9 Public ACC

Schools 2013 ($ millions)

School Support y - ybar Z-score

Maryland 155 64 179

UVA 131 40 112

Louisville 109 18 050

UNC 92 01 003

VaTech 79 -12 -034

FSU 79 -12 -034

GaTech 71 -20 -056

NCSU 65 -26 -073

Clemson 38 -53 -147

Mean=91000 s=35697

Sum = 0 Sum = 0

Recently the mean tuition at 4-yr public collegesuniversities in the US was $6185 with a standard deviation of $1804 In NC the mean tuition was $4320 What is NCrsquos z-score

1 103

2 -103

3 239

4 1865

5 -1865

Section 34Measures of Position (also called Measures of Relative Standing)

Quartiles

5-Number Summary

Interquartile Range Another Measure of Spread

Boxplots

m = median = 34

Q1= first quartile = 23

Q3= third quartile = 42

1 1 062 2 123 3 164 4 195 5 156 6 217 7 238 6 239 5 2510 4 2811 3 2912 2 3313 1 3414 2 3615 3 3716 4 3817 5 3918 6 4119 7 4220 6 4521 5 4722 4 4923 3 5324 2 5625 1 61

Quartiles Measuring spread by examining the middleThe first quartile Q1 is the value in the

sample that has 25 of the data at or

below it (Q1 is the median of the lower

half of the sorted data)

The third quartile Q3 is the value in the

sample that has 75 of the data at or

below it (Q3 is the median of the upper

half of the sorted data)

Quartiles and median divide data into 4 pieces

Q1 M Q3

14 14 14 14

Quartiles are common measures of spread

httpoirpncsueduiradmit

httpoirpncsueduunivpeer

University of Southern California

Economic Value of College Majors

Rules for Calculating QuartilesStep 1 find the median of all the data (the median divides the data in half)

Step 2a find the median of the lower half this median is Q1Step 2b find the median of the upper half this median is Q3

Importantwhen n is odd include the overall median in both halveswhen n is even do not include the overall median in either half

Example 2 4 6 8 10 12 14 16 18 20 n = 10

Median m = (10+12)2 = 222 = 11

Q1 median of lower half 2 4 6 8 10

Q1 = 6

Q3 median of upper half 12 14 16 18 20

Q3 = 16

11

Pulse Rates n = 138

Stem Leaves4

3 4 5889 5 00123344410 5 555678889923 6 0001111112223333334444423 6 5555666666777778888888816 7 0000011222233444423 7 5555566666677788888899910 8 000011222410 8 55556677894 9 00122 9 584 10 0223

101 11 1

Median mean of pulses in locations 69 amp 70 median= (70+70)2=70

Q1 median of lower half (lower half = 69 smallest pulses) Q1 = pulse in ordered position 35Q1 = 63

Q3 median of upper half (upper half = 69 largest pulses) Q3= pulse in position 35 from the high end Q3=78

Below are the weights of 31 linemen on the NCSU football team What is the

value of the first quartile Q1

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 287

2 2575

3 2635

4 2625

Interquartile range another measure of spread

lower quartile Q1

middle quartile median upper quartile Q3

interquartile range (IQR)

IQR = Q3 ndash Q1

measures spread of middle 50 of the data

Example beginning pulse rates

Q3 = 78 Q1 = 63

IQR = 78 ndash 63 = 15

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 13: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Trend Student Debt by State (grads of public 4 yr or more)

NewHam

pshir

e

Delawar

e

Minn

esot

a

South

Caroli

na

Alabam

a

Illino

is

Mon

tana

NewJe

rsey

India

na

Wes

tVirg

inia

Wisc

onsin

Idah

o

Kansa

s

Arkan

sas

Kentu

cky

Ore

gon

Nebra

ska

Colora

do

North

Caroli

na

Wyo

ming

Was

hingt

on

Florida

NewYor

k

Okla

hom

a

Califo

rnia

0

5000

10000

15000

20000

25000

30000

35000

40000

2009-10 2012-13 National Average2009-10 $216042012-13 $25043

Campbell University IncNew Life Theological Seminary

Meredith CollegeMid-Atlantic Christian University

Wake Forest UniversityMethodist University

Johnson C Smith UniversityChowan University

Catawba CollegeMars Hill College

Elon UniversityWingate University

Lenoir-Rhyne UniversityDavidson College

St Andrews Presbyterian CollegeDuke University

Belmont Abbey CollegeMean North Carolina - 4-year or above

Brevard CollegeWarren Wilson College

Mount Olive CollegeSalem College

Saint Augustines CollegeHigh Point University

0 20000 40000 60000

North Carolina Private Schools

Tuition and fees (in-state) Average debt of graduates

UNC Greensboro

UNC School of the Arts

NC A amp T

Mean North Carolina - 4-year or above

NCSU

UNC-Wilmington

UNC Charlotte

ECU

Appalachian

UNC Asheville

Elizabeth City

0 5000 10000 15000 20000 25000

North Carolina Public Schools

Tuition and fees (in-state) Average debt of graduates

Student Debt North Carolina Schools

Unnecessary dimension in a pie chart

3rd dimension is unnecessary the 3D pie chart does not convey any more information than a 2D pie chart

Section 31 continuedDisplaying Quantitative Data

Histograms

Stem and Leaf Displays

Frequency HistogramsBAKER CITY HOSPITAL - LENGTH OF STAY

DISTRIBUTION

0

10

20

30

40

50

60

70

0lt2 2lt4 4lt6 6lt8 8lt10 10lt12 12lt14 14lt16 16lt18

Relative Frequency Histogram of Exam Grades

005

10

15

20

25

30

40 50 60 70 80 90Grade

Rel

ativ

e fr

eque

ncy

100

Histograms

A histogram shows three general types of information

It provides visual indication of where the approximate center of the data is

We can gain an understanding of the degree of spread or variation in the data

We can observe the shape of the distribution

Histograms Showing Different Centers

0

10

20

30

40

50

60

70

0lt2 2lt4 4lt6 6lt8 8lt10 10lt12 12lt14 14lt16 16lt18

0

10

20

30

40

50

60

70

0lt2 2lt4 4lt6 6lt8 8lt10 10lt12 12lt14 14lt16 16lt18

Histograms - Same Center Different Spread

0

10

20

30

40

50

60

70

0lt2

2lt4

4lt6

6lt8

8lt10

10lt12

12lt14

14lt16

16lt18

0

10

20

30

40

50

60

70

0lt2 2lt4 4lt6 6lt8 8lt10 10lt12 12lt14 14lt16 16lt18

Histograms Shape

A distribution is symmetric if the right and left

sides of the histogram are approximately mirror

images of each other

Symmetric distribution

Complex multimodal distribution

Not all distributions have a simple overall shape

especially when there are few observations

Skewed distribution

A distribution is skewed to the right if the right

side of the histogram (side with larger values)

extends much farther out than the left side It is

skewed to the left if the left side of the histogram

extends much farther out than the right side

Shape (cont)Female heart attack patients in New York state

Age left-skewed Cost right-skewed

Shape (cont) outliersAll 200 m Races 202 secs or less

192 1926193219381944 195 1956196219681974 198 1986199219982004 201 20160

10

20

30

40

50

60

200 m Races 202 secs or less (approx 700)

TIMES

Fre

qu

ency Usain Bolt

2008 1930Michael Johnson1996 1932

Alaska Florida

Shape (cont) Outliers

An important kind of deviation is an outlier Outliers are observations

that lie outside the overall pattern of a distribution Always look for

outliers and try to explain them

The overall pattern is fairly

symmetrical except for 2

states clearly not belonging

to the main trend Alaska

and Florida have unusual

representation of the

elderly in their population

A large gap in the

distribution is typically a

sign of an outlier

Excel Example 2012-13 NFL Salaries

3694

80

1273

609

231

2177

738

462

3081

867

692

3985

996

923

4890

126

154

5794

255

385

6698

384

615

7602

513

846

8506

643

077

9410

772

308

1031

4901

54

1121

9030

77

1212

3160

1302

7289

23

1393

1418

46

1483

5547

69

1573

9676

92

1664

3806

15

1754

7935

38

0

100

200

300

400

500

600

700

800

900

1000

Histogram

Bin

Fre

qu

ency

Statcrunch Example 2012-13 NFL Salaries

Heights of Students in Recent Stats Class (Bimodal)

ExampleGrades on a statistics exam

Data

75 66 77 66 64 73 91 65 59 86 61 86 61

58 70 77 80 58 94 78 62 79 83 54 52 45

82 48 67 55

Example-2Frequency Distribution of Grades

Class Limits Frequency40 up to 50

50 up to 60

60 up to 70

70 up to 80

80 up to 90

90 up to 100

Total

2

6

8

7

5

2

30

Example-3 Relative Frequency Distribution of Grades

Class Limits Relative Frequency40 up to 50

50 up to 60

60 up to 70

70 up to 80

80 up to 90

90 up to 100

230 = 067

630 = 200

830 = 267

730 = 233

530 = 167

230 = 067

Relative Frequency Histogram of Grades

005

10

15

20

25

30

40 50 60 70 80 90Grade

Rel

ativ

e fr

eque

ncy

100

Based on the histo-gram about what percent of the values are between 475 and 525

1 50

2 5

3 17

4 30

Stem and leaf displays Have the following general appearance

stem leaf

1 8 9

2 1 2 8 9 9

3 2 3 8 9

4 0 1

5 6 7

6 4

Example employee ages at a small company

18 21 22 19 32 33 40 41 56 57 64 28 29 29 38 39 stem 10rsquos digit leaf 1rsquos digit

18 stem=1 leaf=8 18 = 1 | 8

stem leaf

1 8 9

2 1 2 8 9 9

3 2 3 8 9

4 0 1

5 6 7

6 4

Suppose a 95 yr old is hiredstem leaf

1 8 9

2 1 2 8 9 9

3 2 3 8 9

4 0 1

5 6 7

6 4

7

8

9 5

Number of TD passes by NFL teams 2012-2013 season(stems are 10rsquos digit)

stem leaf

43

03247

2 6677789

2 01222233444

1 13467889

0 8

Pulse Rates n = 138

Stem Leaves 4 3 4 588 9 5 001233444 10 5 5556788899 23 6 00011111122233333344444 23 6 55556666667777788888888 16 7 00000112222334444 23 7 55555666666777888888999 10 8 0000112224 10 8 5555667789 4 9 0012 2 9 58 4 10 0223 10 1 11 1

AdvantagesDisadvantages of Stem-and-Leaf Displays

Advantages

1) each measurement displayed

2) ascending order in each stem row

3) relatively simple (data set not too large) Disadvantages

display becomes unwieldy for large data sets

Population of 185 US cities with between 100000 and 500000

Multiply stems by 100000

Back-to-back stem-and-leaf displays TD passes by NFL teams 1999-2000 2012-13multiply stems by 10

1999-2000 2012-13

2 4 03

6 3 7

2 3 24

6655 2 6677789

43322221100 2 01222233444

9998887666 1 67889

421 1 134

0 8

Below is a stem-and-leaf display for the pulse rates of 24 women at a health clinic How many pulses are between 67 and 77

Stems are 10rsquos digits

1 4

2 6

3 8

4 10

5 12

Other Graphical Methods for Data Time plots

plot observations in time order time on horizontal axis variable on vertical axis

Time series

measurements are taken at regular intervals (monthly unemployment quarterly GDP weather records electricity demand etc)

Heat maps word walls

Unemployment Rate by Educational Attainment

Water Use During Super Bowl XLV(Packers 31 Steelers 25)

Heat Maps

Word Wall (customer feedback)

Section 32Describing the Center of Data

Mean

Median

2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability (next section)

measures how ldquospread outrdquo the data is

Notation for Data Valuesand Sample Mean

1 2

1 2

3

The sample size is denoted by

For a variable denoted by its observations are denoted by

A common measure of center is the sample mean

The sample mean is denoted by

Shorte

n

n

y y yy

n

y

y y y y

y

n

1 21

1

ned expression for using the symbol

(uppercase Greek letter sigma)n

n

i

i n

i

i

y

y y y

yy

n

y

Simple Example of Sample Mean

Weekly TV viewing time in hours of 7 randomly selected 4th graders

19 40 16 12 10 6 and 97

1

7

1

19 40 16 12 10 6 9 112

11216

7 7

ii

ii

y

yy

Population Mean

1

population

population mea

Denoted by the Greek letter

is the size (for example =34000 for NCSU)

the value of is typically not known

we often use the sample mean

to estimat

n

e the unknown

N

ii

y

N N

y

N

value of

Connection Between Mean and Histogram

A histogram balances when supported at the mean Mean x = 1406

Histogram

0

10

20

30

40

50

60

70

118

5

125

5

132

5

139

5

146

5

153

5

16

05

Mo

re

Absences f rom Work

Fre

qu

en

cy

Frequency

The median anothermeasure of center

Given a set of n data values arranged in order of magnitude

Median= middle value n odd

mean of 2 middle values n even

Ex 2 4 6 8 10 n=5 median=6 Ex 2 4 6 8 n=4 median=(4+6)2=5

Student Pulse Rates (n=62)

38 59 60 60 62 62 63 63 64 64 65 67 68 70 70 70 70 70 70 70 71 71 72 72 73 74 74 75 75 75 75 76 77 77 77 77 78 78 79 79 80 80 80 84 84 85 85 87 90 90 91 92 93 94 94 95 96 96 96 98 98 103

Median = (75+76)2 = 755

The median splits the histogram into 2 halves of equal area

Mean balance pointMedian 50 area each half

mean 5526 years median 577years

Medians are used often

Year 2011 baseball salaries

Median $1450000 (max=$32000000 Alex Rodriguez min=$414000)

Median fan age MLB 45 NFL 43 NBA 41 NHL 39

Median existing home sales price May 2011 $166500 May 2010 $174600

Median household income (2008 dollars) 2009 $50221 2008 $52029

Examples Example n = 7

175 28 32 139 141 253 458 Example n = 7 (ordered) 28 32 139 141 175 253 458 Example n = 8

175 28 32 139 141 253 357 458

Example n =8 (ordered)

28 32 139 141 175 253 357 458

m = 141

m = (141+175)2 = 158

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496049604971524555467586

1 5245

2 49655

3 4960

4 4971

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496052455546497155877586

1 5245

2 49655

3 5546

4 4971

Properties of Mean Median1The mean and median are unique that is a

data set has only 1 mean and 1 median (the mean and median are not necessarily equal)

2The mean uses the value of every number in the data set the median does not

14

20 4 6Ex 2 4 6 8 5 5

4 2

21 4 6Ex 2 4 6 9 5 5

4 2

x m

x m

Example class pulse rates

53 64 67 67 70 76 77 77 78 83 84 85 85 89 90 90 90 90 91 96 98 103 140

23

1

23

844823

location 12th obs 85

ii

n

xx

m m

2010 2014 baseball salaries

2010

n = 845

mean = $3297828

median = $1330000

max = $33000000

2014

n = 848

mean = $3932912

median = $1456250

max = $28000000

>

Disadvantage of the mean

Can be greatly influenced by just a few observations that are much greater or much smaller than the rest of the data

Mean Median Maximum Baseball Salaries 1985 - 201419

85

1987

1989

1991

1993

1995

1997

1999

2001

2003

2005

2007

2009

2011

2013

200000

700000

1200000

1700000

2200000

2700000

3200000

3700000

0

5000000

10000000

15000000

20000000

25000000

30000000

35000000

Baseball Salaries Mean Median and Maximum 1985-2014

Mean Median Maximum

Year

Mea

n M

edia

n S

alar

y

Max

imu

m S

alar

y

Skewness comparing the mean and median

Skewed to the right (positively skewed) meangtmedian

53

490

102 7235 21 26 17 8 10 2 3 1 0 0 1

0

100

200

300

400

500

600

Freq

uenc

y

Salary ($1000s)

2011 Baseball Salaries

Skewed to the left negatively skewed

Mean lt median mean=78 median=87

Histogram of Exam Scores

0

10

20

30

20 30 40 50 60 70 80 90 100Exam Scores

Fre

qu

en

cy

Symmetric data

mean median approx equal

Bank Customers 1000-1100 am

0

5

10

15

20

Number of Customers

Fre

qu

en

cy

Section 33Describing Variability of Data

Standard Deviation

Using the Mean and Standard Deviation Together 68-95-997

Rule (Empirical Rule)

Recall 2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability

measures how ldquospread outrdquo the data is

Ways to measure variability

1 range=largest-smallest

ok sometimes in general too crude sensitive to one large or small obs

1

2 where

the middle is the mean

deviation of from the mean

( ) sum the deviations of all the s from

measure spread from the middle

i i

n

i ii

y

y y y

y y y y

1

( ) 0 always tells us nothingn

ii

y y

Example

1 2

1 2

1 2

1 2

sum of deviations from mean

49 51 50

( ) ( ) (49 50) (51 50) 1 1 0

0 100

Data set 1

Data set 2 50

( ) ( ) (0 50) (100 50) 50 50 0

x x x

x x x x

y y y

y y y y

The Sample Standard Deviation a measure of spread around the mean Square the deviation of each

observation from the mean find the square root of the ldquoaveragerdquo of these squared deviations

2

1

2

2 1

( )sample standard deviation

1

( )is called the sample variance

1

n

ii

n

ii

y ys

n

y ys

n

Calculations hellip

Mean = 634

Sum of squared deviations from mean = 852

(n minus 1) = 13 (n minus 1) is called degrees freedom (df)

s2 = variance = 85213 = 655 square inches

s = standard deviation = radic655 = 256 inches

Women height (inches)i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

2

1

2 )(1

1xx

ns

n

i

1 First calculate the variance s22 Then take the square root to get the

standard deviation s

2

1

)(1

1xx

ns

n

i

Meanplusmn 1 sd

Wersquoll never calculate these by hand so make sure to know how to get the standard deviation using your calculator Excel or other software

Population Standard Deviation

2

1

Denoted by the lower case Greek letter

is the size (for example =34000 for NCSU)

is the mean

( )population standard deviation

va

po

lue of typically not known

us

pulation

populatio

e

n

N

ii

N N

y

N

s

to estimate value of

Remarks

1 The standard deviation of a set of measurements is an estimate of the likely size of the chance error in a single measurement

Remarks (cont)

2 Note that s and s are always greater than or equal to zero

3 The larger the value of s (or s ) the greater the spread of the data

When does s=0 When does s =0

When all data values are the same

Remarks (cont)4 The standard deviation is the most

commonly used measure of risk in finance and businessndash Stocks Mutual Funds etc

5 Variance s2 sample variance 2 population variance Units are squared units of the original data square $ square gallons

Review Properties of s and s s and s are always greater than or

equal to 0

when does s = 0 s = 0 The larger the value of s (or s) the

greater the spread of the data the standard deviation of a set of

measurements is an estimate of the likely size of the chance error in a single measurement

Summary of Notation

2

SAMPLE

sample mean

sample median

sample variance

sample stand dev

y

m

s

s

2

POPULATION

population mean

population median

population variance

population stand dev

m

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

68-95-997 rule

Mean andStandard Deviation

(numerical)

Histogram(graphical)

68-95-997 rule

The 68-95-997 ruleIf the histogram of the data is

approximately bell-shaped then1) approximately of the measurements

are of the mean

that is in ( )

2) approximately of the measurement

68

within 1 standard deviation

95

within 2 standard deviation

s

are of the meas n

that is

y s y s

almost all

within 3 standard deviation

in ( 2 2 )

3) the measurements

are of the mean

that is in ( 3 3 )

s

y s y s

y s y s

68-95-997 rule 68 within 1 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

68

3434

y-s y y+s

68-95-997 rule 95 within 2 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

95

475 475

y-2s y y+2s

Example textbook costs

37548

4272

50

y

s

n

286 291 307 308 315 316 327328 340 342 346 347 348 348 349354 355 355 360 361 364 367 369371 373 377 380 381 382 385 385387 390 390 397 398 409 409 410418 422 424 425 426 428 433 434437 440 480

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( ) (33276 41820)

32percentage of data values in this interval 64

5068-95-997 rule 68

y s

y s y s

1 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 2 2 ) (29004 46092)

48percentage of data values in this interval 96

5068-95-997 rule 95

y s

y s y s

2 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 3 3 ) (24732 50364)

50percentage of data values in this interval 100

5068-95-997 rule 997

y s

y s y s

3 standard deviation interval about the mean

The best estimate of the standard deviation of the menrsquos weights

displayed in this dotplot is

1 10

2 15

3 20

4 40

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

Preceding slides Next

Z-scores Standardized Data Values

Measures the distance of a number from the mean in units of

the standard deviation

z-score corresponding to y

where

original data value

the sample mean

s the sample standard deviation

the z-score corresponding to

y yz

s

y

y

z y

Exam 1 y1 = 88 s1 = 6 your exam 1 score 91

Exam 2 y2 = 88 s2 = 10 your exam 2 score 92

Which score is better

1

2

91 88 3z 5

6 692 88 4

z 410 10

91 on exam 1 is better than 92 on exam 2

If data has mean and standard deviation

then standardizing a particular value of

indicates how many standard deviations

is above or below the mean

y s

y

y

y

Comparing SAT and ACT Scores

SAT Math Eleanorrsquos score 680

SAT mean =500 sd=100 ACT Math Geraldrsquos score 27

ACT mean=18 sd=6 Eleanorrsquos z-score z=(680-500)100=18 Geraldrsquos z-score z=(27-18)6=15 Eleanorrsquos score is better

Z-scores add to zeroStudentInstitutional Support to Athletic Depts For the 9 Public ACC

Schools 2013 ($ millions)

School Support y - ybar Z-score

Maryland 155 64 179

UVA 131 40 112

Louisville 109 18 050

UNC 92 01 003

VaTech 79 -12 -034

FSU 79 -12 -034

GaTech 71 -20 -056

NCSU 65 -26 -073

Clemson 38 -53 -147

Mean=91000 s=35697

Sum = 0 Sum = 0

Recently the mean tuition at 4-yr public collegesuniversities in the US was $6185 with a standard deviation of $1804 In NC the mean tuition was $4320 What is NCrsquos z-score

1 103

2 -103

3 239

4 1865

5 -1865

Section 34Measures of Position (also called Measures of Relative Standing)

Quartiles

5-Number Summary

Interquartile Range Another Measure of Spread

Boxplots

m = median = 34

Q1= first quartile = 23

Q3= third quartile = 42

1 1 062 2 123 3 164 4 195 5 156 6 217 7 238 6 239 5 2510 4 2811 3 2912 2 3313 1 3414 2 3615 3 3716 4 3817 5 3918 6 4119 7 4220 6 4521 5 4722 4 4923 3 5324 2 5625 1 61

Quartiles Measuring spread by examining the middleThe first quartile Q1 is the value in the

sample that has 25 of the data at or

below it (Q1 is the median of the lower

half of the sorted data)

The third quartile Q3 is the value in the

sample that has 75 of the data at or

below it (Q3 is the median of the upper

half of the sorted data)

Quartiles and median divide data into 4 pieces

Q1 M Q3

14 14 14 14

Quartiles are common measures of spread

httpoirpncsueduiradmit

httpoirpncsueduunivpeer

University of Southern California

Economic Value of College Majors

Rules for Calculating QuartilesStep 1 find the median of all the data (the median divides the data in half)

Step 2a find the median of the lower half this median is Q1Step 2b find the median of the upper half this median is Q3

Importantwhen n is odd include the overall median in both halveswhen n is even do not include the overall median in either half

Example 2 4 6 8 10 12 14 16 18 20 n = 10

Median m = (10+12)2 = 222 = 11

Q1 median of lower half 2 4 6 8 10

Q1 = 6

Q3 median of upper half 12 14 16 18 20

Q3 = 16

11

Pulse Rates n = 138

Stem Leaves4

3 4 5889 5 00123344410 5 555678889923 6 0001111112223333334444423 6 5555666666777778888888816 7 0000011222233444423 7 5555566666677788888899910 8 000011222410 8 55556677894 9 00122 9 584 10 0223

101 11 1

Median mean of pulses in locations 69 amp 70 median= (70+70)2=70

Q1 median of lower half (lower half = 69 smallest pulses) Q1 = pulse in ordered position 35Q1 = 63

Q3 median of upper half (upper half = 69 largest pulses) Q3= pulse in position 35 from the high end Q3=78

Below are the weights of 31 linemen on the NCSU football team What is the

value of the first quartile Q1

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 287

2 2575

3 2635

4 2625

Interquartile range another measure of spread

lower quartile Q1

middle quartile median upper quartile Q3

interquartile range (IQR)

IQR = Q3 ndash Q1

measures spread of middle 50 of the data

Example beginning pulse rates

Q3 = 78 Q1 = 63

IQR = 78 ndash 63 = 15

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 14: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Campbell University IncNew Life Theological Seminary

Meredith CollegeMid-Atlantic Christian University

Wake Forest UniversityMethodist University

Johnson C Smith UniversityChowan University

Catawba CollegeMars Hill College

Elon UniversityWingate University

Lenoir-Rhyne UniversityDavidson College

St Andrews Presbyterian CollegeDuke University

Belmont Abbey CollegeMean North Carolina - 4-year or above

Brevard CollegeWarren Wilson College

Mount Olive CollegeSalem College

Saint Augustines CollegeHigh Point University

0 20000 40000 60000

North Carolina Private Schools

Tuition and fees (in-state) Average debt of graduates

UNC Greensboro

UNC School of the Arts

NC A amp T

Mean North Carolina - 4-year or above

NCSU

UNC-Wilmington

UNC Charlotte

ECU

Appalachian

UNC Asheville

Elizabeth City

0 5000 10000 15000 20000 25000

North Carolina Public Schools

Tuition and fees (in-state) Average debt of graduates

Student Debt North Carolina Schools

Unnecessary dimension in a pie chart

3rd dimension is unnecessary the 3D pie chart does not convey any more information than a 2D pie chart

Section 31 continuedDisplaying Quantitative Data

Histograms

Stem and Leaf Displays

Frequency HistogramsBAKER CITY HOSPITAL - LENGTH OF STAY

DISTRIBUTION

0

10

20

30

40

50

60

70

0lt2 2lt4 4lt6 6lt8 8lt10 10lt12 12lt14 14lt16 16lt18

Relative Frequency Histogram of Exam Grades

005

10

15

20

25

30

40 50 60 70 80 90Grade

Rel

ativ

e fr

eque

ncy

100

Histograms

A histogram shows three general types of information

It provides visual indication of where the approximate center of the data is

We can gain an understanding of the degree of spread or variation in the data

We can observe the shape of the distribution

Histograms Showing Different Centers

0

10

20

30

40

50

60

70

0lt2 2lt4 4lt6 6lt8 8lt10 10lt12 12lt14 14lt16 16lt18

0

10

20

30

40

50

60

70

0lt2 2lt4 4lt6 6lt8 8lt10 10lt12 12lt14 14lt16 16lt18

Histograms - Same Center Different Spread

0

10

20

30

40

50

60

70

0lt2

2lt4

4lt6

6lt8

8lt10

10lt12

12lt14

14lt16

16lt18

0

10

20

30

40

50

60

70

0lt2 2lt4 4lt6 6lt8 8lt10 10lt12 12lt14 14lt16 16lt18

Histograms Shape

A distribution is symmetric if the right and left

sides of the histogram are approximately mirror

images of each other

Symmetric distribution

Complex multimodal distribution

Not all distributions have a simple overall shape

especially when there are few observations

Skewed distribution

A distribution is skewed to the right if the right

side of the histogram (side with larger values)

extends much farther out than the left side It is

skewed to the left if the left side of the histogram

extends much farther out than the right side

Shape (cont)Female heart attack patients in New York state

Age left-skewed Cost right-skewed

Shape (cont) outliersAll 200 m Races 202 secs or less

192 1926193219381944 195 1956196219681974 198 1986199219982004 201 20160

10

20

30

40

50

60

200 m Races 202 secs or less (approx 700)

TIMES

Fre

qu

ency Usain Bolt

2008 1930Michael Johnson1996 1932

Alaska Florida

Shape (cont) Outliers

An important kind of deviation is an outlier Outliers are observations

that lie outside the overall pattern of a distribution Always look for

outliers and try to explain them

The overall pattern is fairly

symmetrical except for 2

states clearly not belonging

to the main trend Alaska

and Florida have unusual

representation of the

elderly in their population

A large gap in the

distribution is typically a

sign of an outlier

Excel Example 2012-13 NFL Salaries

3694

80

1273

609

231

2177

738

462

3081

867

692

3985

996

923

4890

126

154

5794

255

385

6698

384

615

7602

513

846

8506

643

077

9410

772

308

1031

4901

54

1121

9030

77

1212

3160

1302

7289

23

1393

1418

46

1483

5547

69

1573

9676

92

1664

3806

15

1754

7935

38

0

100

200

300

400

500

600

700

800

900

1000

Histogram

Bin

Fre

qu

ency

Statcrunch Example 2012-13 NFL Salaries

Heights of Students in Recent Stats Class (Bimodal)

ExampleGrades on a statistics exam

Data

75 66 77 66 64 73 91 65 59 86 61 86 61

58 70 77 80 58 94 78 62 79 83 54 52 45

82 48 67 55

Example-2Frequency Distribution of Grades

Class Limits Frequency40 up to 50

50 up to 60

60 up to 70

70 up to 80

80 up to 90

90 up to 100

Total

2

6

8

7

5

2

30

Example-3 Relative Frequency Distribution of Grades

Class Limits Relative Frequency40 up to 50

50 up to 60

60 up to 70

70 up to 80

80 up to 90

90 up to 100

230 = 067

630 = 200

830 = 267

730 = 233

530 = 167

230 = 067

Relative Frequency Histogram of Grades

005

10

15

20

25

30

40 50 60 70 80 90Grade

Rel

ativ

e fr

eque

ncy

100

Based on the histo-gram about what percent of the values are between 475 and 525

1 50

2 5

3 17

4 30

Stem and leaf displays Have the following general appearance

stem leaf

1 8 9

2 1 2 8 9 9

3 2 3 8 9

4 0 1

5 6 7

6 4

Example employee ages at a small company

18 21 22 19 32 33 40 41 56 57 64 28 29 29 38 39 stem 10rsquos digit leaf 1rsquos digit

18 stem=1 leaf=8 18 = 1 | 8

stem leaf

1 8 9

2 1 2 8 9 9

3 2 3 8 9

4 0 1

5 6 7

6 4

Suppose a 95 yr old is hiredstem leaf

1 8 9

2 1 2 8 9 9

3 2 3 8 9

4 0 1

5 6 7

6 4

7

8

9 5

Number of TD passes by NFL teams 2012-2013 season(stems are 10rsquos digit)

stem leaf

43

03247

2 6677789

2 01222233444

1 13467889

0 8

Pulse Rates n = 138

Stem Leaves 4 3 4 588 9 5 001233444 10 5 5556788899 23 6 00011111122233333344444 23 6 55556666667777788888888 16 7 00000112222334444 23 7 55555666666777888888999 10 8 0000112224 10 8 5555667789 4 9 0012 2 9 58 4 10 0223 10 1 11 1

AdvantagesDisadvantages of Stem-and-Leaf Displays

Advantages

1) each measurement displayed

2) ascending order in each stem row

3) relatively simple (data set not too large) Disadvantages

display becomes unwieldy for large data sets

Population of 185 US cities with between 100000 and 500000

Multiply stems by 100000

Back-to-back stem-and-leaf displays TD passes by NFL teams 1999-2000 2012-13multiply stems by 10

1999-2000 2012-13

2 4 03

6 3 7

2 3 24

6655 2 6677789

43322221100 2 01222233444

9998887666 1 67889

421 1 134

0 8

Below is a stem-and-leaf display for the pulse rates of 24 women at a health clinic How many pulses are between 67 and 77

Stems are 10rsquos digits

1 4

2 6

3 8

4 10

5 12

Other Graphical Methods for Data Time plots

plot observations in time order time on horizontal axis variable on vertical axis

Time series

measurements are taken at regular intervals (monthly unemployment quarterly GDP weather records electricity demand etc)

Heat maps word walls

Unemployment Rate by Educational Attainment

Water Use During Super Bowl XLV(Packers 31 Steelers 25)

Heat Maps

Word Wall (customer feedback)

Section 32Describing the Center of Data

Mean

Median

2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability (next section)

measures how ldquospread outrdquo the data is

Notation for Data Valuesand Sample Mean

1 2

1 2

3

The sample size is denoted by

For a variable denoted by its observations are denoted by

A common measure of center is the sample mean

The sample mean is denoted by

Shorte

n

n

y y yy

n

y

y y y y

y

n

1 21

1

ned expression for using the symbol

(uppercase Greek letter sigma)n

n

i

i n

i

i

y

y y y

yy

n

y

Simple Example of Sample Mean

Weekly TV viewing time in hours of 7 randomly selected 4th graders

19 40 16 12 10 6 and 97

1

7

1

19 40 16 12 10 6 9 112

11216

7 7

ii

ii

y

yy

Population Mean

1

population

population mea

Denoted by the Greek letter

is the size (for example =34000 for NCSU)

the value of is typically not known

we often use the sample mean

to estimat

n

e the unknown

N

ii

y

N N

y

N

value of

Connection Between Mean and Histogram

A histogram balances when supported at the mean Mean x = 1406

Histogram

0

10

20

30

40

50

60

70

118

5

125

5

132

5

139

5

146

5

153

5

16

05

Mo

re

Absences f rom Work

Fre

qu

en

cy

Frequency

The median anothermeasure of center

Given a set of n data values arranged in order of magnitude

Median= middle value n odd

mean of 2 middle values n even

Ex 2 4 6 8 10 n=5 median=6 Ex 2 4 6 8 n=4 median=(4+6)2=5

Student Pulse Rates (n=62)

38 59 60 60 62 62 63 63 64 64 65 67 68 70 70 70 70 70 70 70 71 71 72 72 73 74 74 75 75 75 75 76 77 77 77 77 78 78 79 79 80 80 80 84 84 85 85 87 90 90 91 92 93 94 94 95 96 96 96 98 98 103

Median = (75+76)2 = 755

The median splits the histogram into 2 halves of equal area

Mean balance pointMedian 50 area each half

mean 5526 years median 577years

Medians are used often

Year 2011 baseball salaries

Median $1450000 (max=$32000000 Alex Rodriguez min=$414000)

Median fan age MLB 45 NFL 43 NBA 41 NHL 39

Median existing home sales price May 2011 $166500 May 2010 $174600

Median household income (2008 dollars) 2009 $50221 2008 $52029

Examples Example n = 7

175 28 32 139 141 253 458 Example n = 7 (ordered) 28 32 139 141 175 253 458 Example n = 8

175 28 32 139 141 253 357 458

Example n =8 (ordered)

28 32 139 141 175 253 357 458

m = 141

m = (141+175)2 = 158

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496049604971524555467586

1 5245

2 49655

3 4960

4 4971

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496052455546497155877586

1 5245

2 49655

3 5546

4 4971

Properties of Mean Median1The mean and median are unique that is a

data set has only 1 mean and 1 median (the mean and median are not necessarily equal)

2The mean uses the value of every number in the data set the median does not

14

20 4 6Ex 2 4 6 8 5 5

4 2

21 4 6Ex 2 4 6 9 5 5

4 2

x m

x m

Example class pulse rates

53 64 67 67 70 76 77 77 78 83 84 85 85 89 90 90 90 90 91 96 98 103 140

23

1

23

844823

location 12th obs 85

ii

n

xx

m m

2010 2014 baseball salaries

2010

n = 845

mean = $3297828

median = $1330000

max = $33000000

2014

n = 848

mean = $3932912

median = $1456250

max = $28000000

>

Disadvantage of the mean

Can be greatly influenced by just a few observations that are much greater or much smaller than the rest of the data

Mean Median Maximum Baseball Salaries 1985 - 201419

85

1987

1989

1991

1993

1995

1997

1999

2001

2003

2005

2007

2009

2011

2013

200000

700000

1200000

1700000

2200000

2700000

3200000

3700000

0

5000000

10000000

15000000

20000000

25000000

30000000

35000000

Baseball Salaries Mean Median and Maximum 1985-2014

Mean Median Maximum

Year

Mea

n M

edia

n S

alar

y

Max

imu

m S

alar

y

Skewness comparing the mean and median

Skewed to the right (positively skewed) meangtmedian

53

490

102 7235 21 26 17 8 10 2 3 1 0 0 1

0

100

200

300

400

500

600

Freq

uenc

y

Salary ($1000s)

2011 Baseball Salaries

Skewed to the left negatively skewed

Mean lt median mean=78 median=87

Histogram of Exam Scores

0

10

20

30

20 30 40 50 60 70 80 90 100Exam Scores

Fre

qu

en

cy

Symmetric data

mean median approx equal

Bank Customers 1000-1100 am

0

5

10

15

20

Number of Customers

Fre

qu

en

cy

Section 33Describing Variability of Data

Standard Deviation

Using the Mean and Standard Deviation Together 68-95-997

Rule (Empirical Rule)

Recall 2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability

measures how ldquospread outrdquo the data is

Ways to measure variability

1 range=largest-smallest

ok sometimes in general too crude sensitive to one large or small obs

1

2 where

the middle is the mean

deviation of from the mean

( ) sum the deviations of all the s from

measure spread from the middle

i i

n

i ii

y

y y y

y y y y

1

( ) 0 always tells us nothingn

ii

y y

Example

1 2

1 2

1 2

1 2

sum of deviations from mean

49 51 50

( ) ( ) (49 50) (51 50) 1 1 0

0 100

Data set 1

Data set 2 50

( ) ( ) (0 50) (100 50) 50 50 0

x x x

x x x x

y y y

y y y y

The Sample Standard Deviation a measure of spread around the mean Square the deviation of each

observation from the mean find the square root of the ldquoaveragerdquo of these squared deviations

2

1

2

2 1

( )sample standard deviation

1

( )is called the sample variance

1

n

ii

n

ii

y ys

n

y ys

n

Calculations hellip

Mean = 634

Sum of squared deviations from mean = 852

(n minus 1) = 13 (n minus 1) is called degrees freedom (df)

s2 = variance = 85213 = 655 square inches

s = standard deviation = radic655 = 256 inches

Women height (inches)i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

2

1

2 )(1

1xx

ns

n

i

1 First calculate the variance s22 Then take the square root to get the

standard deviation s

2

1

)(1

1xx

ns

n

i

Meanplusmn 1 sd

Wersquoll never calculate these by hand so make sure to know how to get the standard deviation using your calculator Excel or other software

Population Standard Deviation

2

1

Denoted by the lower case Greek letter

is the size (for example =34000 for NCSU)

is the mean

( )population standard deviation

va

po

lue of typically not known

us

pulation

populatio

e

n

N

ii

N N

y

N

s

to estimate value of

Remarks

1 The standard deviation of a set of measurements is an estimate of the likely size of the chance error in a single measurement

Remarks (cont)

2 Note that s and s are always greater than or equal to zero

3 The larger the value of s (or s ) the greater the spread of the data

When does s=0 When does s =0

When all data values are the same

Remarks (cont)4 The standard deviation is the most

commonly used measure of risk in finance and businessndash Stocks Mutual Funds etc

5 Variance s2 sample variance 2 population variance Units are squared units of the original data square $ square gallons

Review Properties of s and s s and s are always greater than or

equal to 0

when does s = 0 s = 0 The larger the value of s (or s) the

greater the spread of the data the standard deviation of a set of

measurements is an estimate of the likely size of the chance error in a single measurement

Summary of Notation

2

SAMPLE

sample mean

sample median

sample variance

sample stand dev

y

m

s

s

2

POPULATION

population mean

population median

population variance

population stand dev

m

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

68-95-997 rule

Mean andStandard Deviation

(numerical)

Histogram(graphical)

68-95-997 rule

The 68-95-997 ruleIf the histogram of the data is

approximately bell-shaped then1) approximately of the measurements

are of the mean

that is in ( )

2) approximately of the measurement

68

within 1 standard deviation

95

within 2 standard deviation

s

are of the meas n

that is

y s y s

almost all

within 3 standard deviation

in ( 2 2 )

3) the measurements

are of the mean

that is in ( 3 3 )

s

y s y s

y s y s

68-95-997 rule 68 within 1 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

68

3434

y-s y y+s

68-95-997 rule 95 within 2 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

95

475 475

y-2s y y+2s

Example textbook costs

37548

4272

50

y

s

n

286 291 307 308 315 316 327328 340 342 346 347 348 348 349354 355 355 360 361 364 367 369371 373 377 380 381 382 385 385387 390 390 397 398 409 409 410418 422 424 425 426 428 433 434437 440 480

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( ) (33276 41820)

32percentage of data values in this interval 64

5068-95-997 rule 68

y s

y s y s

1 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 2 2 ) (29004 46092)

48percentage of data values in this interval 96

5068-95-997 rule 95

y s

y s y s

2 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 3 3 ) (24732 50364)

50percentage of data values in this interval 100

5068-95-997 rule 997

y s

y s y s

3 standard deviation interval about the mean

The best estimate of the standard deviation of the menrsquos weights

displayed in this dotplot is

1 10

2 15

3 20

4 40

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

Preceding slides Next

Z-scores Standardized Data Values

Measures the distance of a number from the mean in units of

the standard deviation

z-score corresponding to y

where

original data value

the sample mean

s the sample standard deviation

the z-score corresponding to

y yz

s

y

y

z y

Exam 1 y1 = 88 s1 = 6 your exam 1 score 91

Exam 2 y2 = 88 s2 = 10 your exam 2 score 92

Which score is better

1

2

91 88 3z 5

6 692 88 4

z 410 10

91 on exam 1 is better than 92 on exam 2

If data has mean and standard deviation

then standardizing a particular value of

indicates how many standard deviations

is above or below the mean

y s

y

y

y

Comparing SAT and ACT Scores

SAT Math Eleanorrsquos score 680

SAT mean =500 sd=100 ACT Math Geraldrsquos score 27

ACT mean=18 sd=6 Eleanorrsquos z-score z=(680-500)100=18 Geraldrsquos z-score z=(27-18)6=15 Eleanorrsquos score is better

Z-scores add to zeroStudentInstitutional Support to Athletic Depts For the 9 Public ACC

Schools 2013 ($ millions)

School Support y - ybar Z-score

Maryland 155 64 179

UVA 131 40 112

Louisville 109 18 050

UNC 92 01 003

VaTech 79 -12 -034

FSU 79 -12 -034

GaTech 71 -20 -056

NCSU 65 -26 -073

Clemson 38 -53 -147

Mean=91000 s=35697

Sum = 0 Sum = 0

Recently the mean tuition at 4-yr public collegesuniversities in the US was $6185 with a standard deviation of $1804 In NC the mean tuition was $4320 What is NCrsquos z-score

1 103

2 -103

3 239

4 1865

5 -1865

Section 34Measures of Position (also called Measures of Relative Standing)

Quartiles

5-Number Summary

Interquartile Range Another Measure of Spread

Boxplots

m = median = 34

Q1= first quartile = 23

Q3= third quartile = 42

1 1 062 2 123 3 164 4 195 5 156 6 217 7 238 6 239 5 2510 4 2811 3 2912 2 3313 1 3414 2 3615 3 3716 4 3817 5 3918 6 4119 7 4220 6 4521 5 4722 4 4923 3 5324 2 5625 1 61

Quartiles Measuring spread by examining the middleThe first quartile Q1 is the value in the

sample that has 25 of the data at or

below it (Q1 is the median of the lower

half of the sorted data)

The third quartile Q3 is the value in the

sample that has 75 of the data at or

below it (Q3 is the median of the upper

half of the sorted data)

Quartiles and median divide data into 4 pieces

Q1 M Q3

14 14 14 14

Quartiles are common measures of spread

httpoirpncsueduiradmit

httpoirpncsueduunivpeer

University of Southern California

Economic Value of College Majors

Rules for Calculating QuartilesStep 1 find the median of all the data (the median divides the data in half)

Step 2a find the median of the lower half this median is Q1Step 2b find the median of the upper half this median is Q3

Importantwhen n is odd include the overall median in both halveswhen n is even do not include the overall median in either half

Example 2 4 6 8 10 12 14 16 18 20 n = 10

Median m = (10+12)2 = 222 = 11

Q1 median of lower half 2 4 6 8 10

Q1 = 6

Q3 median of upper half 12 14 16 18 20

Q3 = 16

11

Pulse Rates n = 138

Stem Leaves4

3 4 5889 5 00123344410 5 555678889923 6 0001111112223333334444423 6 5555666666777778888888816 7 0000011222233444423 7 5555566666677788888899910 8 000011222410 8 55556677894 9 00122 9 584 10 0223

101 11 1

Median mean of pulses in locations 69 amp 70 median= (70+70)2=70

Q1 median of lower half (lower half = 69 smallest pulses) Q1 = pulse in ordered position 35Q1 = 63

Q3 median of upper half (upper half = 69 largest pulses) Q3= pulse in position 35 from the high end Q3=78

Below are the weights of 31 linemen on the NCSU football team What is the

value of the first quartile Q1

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 287

2 2575

3 2635

4 2625

Interquartile range another measure of spread

lower quartile Q1

middle quartile median upper quartile Q3

interquartile range (IQR)

IQR = Q3 ndash Q1

measures spread of middle 50 of the data

Example beginning pulse rates

Q3 = 78 Q1 = 63

IQR = 78 ndash 63 = 15

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 15: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Unnecessary dimension in a pie chart

3rd dimension is unnecessary the 3D pie chart does not convey any more information than a 2D pie chart

Section 31 continuedDisplaying Quantitative Data

Histograms

Stem and Leaf Displays

Frequency HistogramsBAKER CITY HOSPITAL - LENGTH OF STAY

DISTRIBUTION

0

10

20

30

40

50

60

70

0lt2 2lt4 4lt6 6lt8 8lt10 10lt12 12lt14 14lt16 16lt18

Relative Frequency Histogram of Exam Grades

005

10

15

20

25

30

40 50 60 70 80 90Grade

Rel

ativ

e fr

eque

ncy

100

Histograms

A histogram shows three general types of information

It provides visual indication of where the approximate center of the data is

We can gain an understanding of the degree of spread or variation in the data

We can observe the shape of the distribution

Histograms Showing Different Centers

0

10

20

30

40

50

60

70

0lt2 2lt4 4lt6 6lt8 8lt10 10lt12 12lt14 14lt16 16lt18

0

10

20

30

40

50

60

70

0lt2 2lt4 4lt6 6lt8 8lt10 10lt12 12lt14 14lt16 16lt18

Histograms - Same Center Different Spread

0

10

20

30

40

50

60

70

0lt2

2lt4

4lt6

6lt8

8lt10

10lt12

12lt14

14lt16

16lt18

0

10

20

30

40

50

60

70

0lt2 2lt4 4lt6 6lt8 8lt10 10lt12 12lt14 14lt16 16lt18

Histograms Shape

A distribution is symmetric if the right and left

sides of the histogram are approximately mirror

images of each other

Symmetric distribution

Complex multimodal distribution

Not all distributions have a simple overall shape

especially when there are few observations

Skewed distribution

A distribution is skewed to the right if the right

side of the histogram (side with larger values)

extends much farther out than the left side It is

skewed to the left if the left side of the histogram

extends much farther out than the right side

Shape (cont)Female heart attack patients in New York state

Age left-skewed Cost right-skewed

Shape (cont) outliersAll 200 m Races 202 secs or less

192 1926193219381944 195 1956196219681974 198 1986199219982004 201 20160

10

20

30

40

50

60

200 m Races 202 secs or less (approx 700)

TIMES

Fre

qu

ency Usain Bolt

2008 1930Michael Johnson1996 1932

Alaska Florida

Shape (cont) Outliers

An important kind of deviation is an outlier Outliers are observations

that lie outside the overall pattern of a distribution Always look for

outliers and try to explain them

The overall pattern is fairly

symmetrical except for 2

states clearly not belonging

to the main trend Alaska

and Florida have unusual

representation of the

elderly in their population

A large gap in the

distribution is typically a

sign of an outlier

Excel Example 2012-13 NFL Salaries

3694

80

1273

609

231

2177

738

462

3081

867

692

3985

996

923

4890

126

154

5794

255

385

6698

384

615

7602

513

846

8506

643

077

9410

772

308

1031

4901

54

1121

9030

77

1212

3160

1302

7289

23

1393

1418

46

1483

5547

69

1573

9676

92

1664

3806

15

1754

7935

38

0

100

200

300

400

500

600

700

800

900

1000

Histogram

Bin

Fre

qu

ency

Statcrunch Example 2012-13 NFL Salaries

Heights of Students in Recent Stats Class (Bimodal)

ExampleGrades on a statistics exam

Data

75 66 77 66 64 73 91 65 59 86 61 86 61

58 70 77 80 58 94 78 62 79 83 54 52 45

82 48 67 55

Example-2Frequency Distribution of Grades

Class Limits Frequency40 up to 50

50 up to 60

60 up to 70

70 up to 80

80 up to 90

90 up to 100

Total

2

6

8

7

5

2

30

Example-3 Relative Frequency Distribution of Grades

Class Limits Relative Frequency40 up to 50

50 up to 60

60 up to 70

70 up to 80

80 up to 90

90 up to 100

230 = 067

630 = 200

830 = 267

730 = 233

530 = 167

230 = 067

Relative Frequency Histogram of Grades

005

10

15

20

25

30

40 50 60 70 80 90Grade

Rel

ativ

e fr

eque

ncy

100

Based on the histo-gram about what percent of the values are between 475 and 525

1 50

2 5

3 17

4 30

Stem and leaf displays Have the following general appearance

stem leaf

1 8 9

2 1 2 8 9 9

3 2 3 8 9

4 0 1

5 6 7

6 4

Example employee ages at a small company

18 21 22 19 32 33 40 41 56 57 64 28 29 29 38 39 stem 10rsquos digit leaf 1rsquos digit

18 stem=1 leaf=8 18 = 1 | 8

stem leaf

1 8 9

2 1 2 8 9 9

3 2 3 8 9

4 0 1

5 6 7

6 4

Suppose a 95 yr old is hiredstem leaf

1 8 9

2 1 2 8 9 9

3 2 3 8 9

4 0 1

5 6 7

6 4

7

8

9 5

Number of TD passes by NFL teams 2012-2013 season(stems are 10rsquos digit)

stem leaf

43

03247

2 6677789

2 01222233444

1 13467889

0 8

Pulse Rates n = 138

Stem Leaves 4 3 4 588 9 5 001233444 10 5 5556788899 23 6 00011111122233333344444 23 6 55556666667777788888888 16 7 00000112222334444 23 7 55555666666777888888999 10 8 0000112224 10 8 5555667789 4 9 0012 2 9 58 4 10 0223 10 1 11 1

AdvantagesDisadvantages of Stem-and-Leaf Displays

Advantages

1) each measurement displayed

2) ascending order in each stem row

3) relatively simple (data set not too large) Disadvantages

display becomes unwieldy for large data sets

Population of 185 US cities with between 100000 and 500000

Multiply stems by 100000

Back-to-back stem-and-leaf displays TD passes by NFL teams 1999-2000 2012-13multiply stems by 10

1999-2000 2012-13

2 4 03

6 3 7

2 3 24

6655 2 6677789

43322221100 2 01222233444

9998887666 1 67889

421 1 134

0 8

Below is a stem-and-leaf display for the pulse rates of 24 women at a health clinic How many pulses are between 67 and 77

Stems are 10rsquos digits

1 4

2 6

3 8

4 10

5 12

Other Graphical Methods for Data Time plots

plot observations in time order time on horizontal axis variable on vertical axis

Time series

measurements are taken at regular intervals (monthly unemployment quarterly GDP weather records electricity demand etc)

Heat maps word walls

Unemployment Rate by Educational Attainment

Water Use During Super Bowl XLV(Packers 31 Steelers 25)

Heat Maps

Word Wall (customer feedback)

Section 32Describing the Center of Data

Mean

Median

2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability (next section)

measures how ldquospread outrdquo the data is

Notation for Data Valuesand Sample Mean

1 2

1 2

3

The sample size is denoted by

For a variable denoted by its observations are denoted by

A common measure of center is the sample mean

The sample mean is denoted by

Shorte

n

n

y y yy

n

y

y y y y

y

n

1 21

1

ned expression for using the symbol

(uppercase Greek letter sigma)n

n

i

i n

i

i

y

y y y

yy

n

y

Simple Example of Sample Mean

Weekly TV viewing time in hours of 7 randomly selected 4th graders

19 40 16 12 10 6 and 97

1

7

1

19 40 16 12 10 6 9 112

11216

7 7

ii

ii

y

yy

Population Mean

1

population

population mea

Denoted by the Greek letter

is the size (for example =34000 for NCSU)

the value of is typically not known

we often use the sample mean

to estimat

n

e the unknown

N

ii

y

N N

y

N

value of

Connection Between Mean and Histogram

A histogram balances when supported at the mean Mean x = 1406

Histogram

0

10

20

30

40

50

60

70

118

5

125

5

132

5

139

5

146

5

153

5

16

05

Mo

re

Absences f rom Work

Fre

qu

en

cy

Frequency

The median anothermeasure of center

Given a set of n data values arranged in order of magnitude

Median= middle value n odd

mean of 2 middle values n even

Ex 2 4 6 8 10 n=5 median=6 Ex 2 4 6 8 n=4 median=(4+6)2=5

Student Pulse Rates (n=62)

38 59 60 60 62 62 63 63 64 64 65 67 68 70 70 70 70 70 70 70 71 71 72 72 73 74 74 75 75 75 75 76 77 77 77 77 78 78 79 79 80 80 80 84 84 85 85 87 90 90 91 92 93 94 94 95 96 96 96 98 98 103

Median = (75+76)2 = 755

The median splits the histogram into 2 halves of equal area

Mean balance pointMedian 50 area each half

mean 5526 years median 577years

Medians are used often

Year 2011 baseball salaries

Median $1450000 (max=$32000000 Alex Rodriguez min=$414000)

Median fan age MLB 45 NFL 43 NBA 41 NHL 39

Median existing home sales price May 2011 $166500 May 2010 $174600

Median household income (2008 dollars) 2009 $50221 2008 $52029

Examples Example n = 7

175 28 32 139 141 253 458 Example n = 7 (ordered) 28 32 139 141 175 253 458 Example n = 8

175 28 32 139 141 253 357 458

Example n =8 (ordered)

28 32 139 141 175 253 357 458

m = 141

m = (141+175)2 = 158

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496049604971524555467586

1 5245

2 49655

3 4960

4 4971

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496052455546497155877586

1 5245

2 49655

3 5546

4 4971

Properties of Mean Median1The mean and median are unique that is a

data set has only 1 mean and 1 median (the mean and median are not necessarily equal)

2The mean uses the value of every number in the data set the median does not

14

20 4 6Ex 2 4 6 8 5 5

4 2

21 4 6Ex 2 4 6 9 5 5

4 2

x m

x m

Example class pulse rates

53 64 67 67 70 76 77 77 78 83 84 85 85 89 90 90 90 90 91 96 98 103 140

23

1

23

844823

location 12th obs 85

ii

n

xx

m m

2010 2014 baseball salaries

2010

n = 845

mean = $3297828

median = $1330000

max = $33000000

2014

n = 848

mean = $3932912

median = $1456250

max = $28000000

>

Disadvantage of the mean

Can be greatly influenced by just a few observations that are much greater or much smaller than the rest of the data

Mean Median Maximum Baseball Salaries 1985 - 201419

85

1987

1989

1991

1993

1995

1997

1999

2001

2003

2005

2007

2009

2011

2013

200000

700000

1200000

1700000

2200000

2700000

3200000

3700000

0

5000000

10000000

15000000

20000000

25000000

30000000

35000000

Baseball Salaries Mean Median and Maximum 1985-2014

Mean Median Maximum

Year

Mea

n M

edia

n S

alar

y

Max

imu

m S

alar

y

Skewness comparing the mean and median

Skewed to the right (positively skewed) meangtmedian

53

490

102 7235 21 26 17 8 10 2 3 1 0 0 1

0

100

200

300

400

500

600

Freq

uenc

y

Salary ($1000s)

2011 Baseball Salaries

Skewed to the left negatively skewed

Mean lt median mean=78 median=87

Histogram of Exam Scores

0

10

20

30

20 30 40 50 60 70 80 90 100Exam Scores

Fre

qu

en

cy

Symmetric data

mean median approx equal

Bank Customers 1000-1100 am

0

5

10

15

20

Number of Customers

Fre

qu

en

cy

Section 33Describing Variability of Data

Standard Deviation

Using the Mean and Standard Deviation Together 68-95-997

Rule (Empirical Rule)

Recall 2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability

measures how ldquospread outrdquo the data is

Ways to measure variability

1 range=largest-smallest

ok sometimes in general too crude sensitive to one large or small obs

1

2 where

the middle is the mean

deviation of from the mean

( ) sum the deviations of all the s from

measure spread from the middle

i i

n

i ii

y

y y y

y y y y

1

( ) 0 always tells us nothingn

ii

y y

Example

1 2

1 2

1 2

1 2

sum of deviations from mean

49 51 50

( ) ( ) (49 50) (51 50) 1 1 0

0 100

Data set 1

Data set 2 50

( ) ( ) (0 50) (100 50) 50 50 0

x x x

x x x x

y y y

y y y y

The Sample Standard Deviation a measure of spread around the mean Square the deviation of each

observation from the mean find the square root of the ldquoaveragerdquo of these squared deviations

2

1

2

2 1

( )sample standard deviation

1

( )is called the sample variance

1

n

ii

n

ii

y ys

n

y ys

n

Calculations hellip

Mean = 634

Sum of squared deviations from mean = 852

(n minus 1) = 13 (n minus 1) is called degrees freedom (df)

s2 = variance = 85213 = 655 square inches

s = standard deviation = radic655 = 256 inches

Women height (inches)i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

2

1

2 )(1

1xx

ns

n

i

1 First calculate the variance s22 Then take the square root to get the

standard deviation s

2

1

)(1

1xx

ns

n

i

Meanplusmn 1 sd

Wersquoll never calculate these by hand so make sure to know how to get the standard deviation using your calculator Excel or other software

Population Standard Deviation

2

1

Denoted by the lower case Greek letter

is the size (for example =34000 for NCSU)

is the mean

( )population standard deviation

va

po

lue of typically not known

us

pulation

populatio

e

n

N

ii

N N

y

N

s

to estimate value of

Remarks

1 The standard deviation of a set of measurements is an estimate of the likely size of the chance error in a single measurement

Remarks (cont)

2 Note that s and s are always greater than or equal to zero

3 The larger the value of s (or s ) the greater the spread of the data

When does s=0 When does s =0

When all data values are the same

Remarks (cont)4 The standard deviation is the most

commonly used measure of risk in finance and businessndash Stocks Mutual Funds etc

5 Variance s2 sample variance 2 population variance Units are squared units of the original data square $ square gallons

Review Properties of s and s s and s are always greater than or

equal to 0

when does s = 0 s = 0 The larger the value of s (or s) the

greater the spread of the data the standard deviation of a set of

measurements is an estimate of the likely size of the chance error in a single measurement

Summary of Notation

2

SAMPLE

sample mean

sample median

sample variance

sample stand dev

y

m

s

s

2

POPULATION

population mean

population median

population variance

population stand dev

m

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

68-95-997 rule

Mean andStandard Deviation

(numerical)

Histogram(graphical)

68-95-997 rule

The 68-95-997 ruleIf the histogram of the data is

approximately bell-shaped then1) approximately of the measurements

are of the mean

that is in ( )

2) approximately of the measurement

68

within 1 standard deviation

95

within 2 standard deviation

s

are of the meas n

that is

y s y s

almost all

within 3 standard deviation

in ( 2 2 )

3) the measurements

are of the mean

that is in ( 3 3 )

s

y s y s

y s y s

68-95-997 rule 68 within 1 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

68

3434

y-s y y+s

68-95-997 rule 95 within 2 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

95

475 475

y-2s y y+2s

Example textbook costs

37548

4272

50

y

s

n

286 291 307 308 315 316 327328 340 342 346 347 348 348 349354 355 355 360 361 364 367 369371 373 377 380 381 382 385 385387 390 390 397 398 409 409 410418 422 424 425 426 428 433 434437 440 480

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( ) (33276 41820)

32percentage of data values in this interval 64

5068-95-997 rule 68

y s

y s y s

1 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 2 2 ) (29004 46092)

48percentage of data values in this interval 96

5068-95-997 rule 95

y s

y s y s

2 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 3 3 ) (24732 50364)

50percentage of data values in this interval 100

5068-95-997 rule 997

y s

y s y s

3 standard deviation interval about the mean

The best estimate of the standard deviation of the menrsquos weights

displayed in this dotplot is

1 10

2 15

3 20

4 40

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

Preceding slides Next

Z-scores Standardized Data Values

Measures the distance of a number from the mean in units of

the standard deviation

z-score corresponding to y

where

original data value

the sample mean

s the sample standard deviation

the z-score corresponding to

y yz

s

y

y

z y

Exam 1 y1 = 88 s1 = 6 your exam 1 score 91

Exam 2 y2 = 88 s2 = 10 your exam 2 score 92

Which score is better

1

2

91 88 3z 5

6 692 88 4

z 410 10

91 on exam 1 is better than 92 on exam 2

If data has mean and standard deviation

then standardizing a particular value of

indicates how many standard deviations

is above or below the mean

y s

y

y

y

Comparing SAT and ACT Scores

SAT Math Eleanorrsquos score 680

SAT mean =500 sd=100 ACT Math Geraldrsquos score 27

ACT mean=18 sd=6 Eleanorrsquos z-score z=(680-500)100=18 Geraldrsquos z-score z=(27-18)6=15 Eleanorrsquos score is better

Z-scores add to zeroStudentInstitutional Support to Athletic Depts For the 9 Public ACC

Schools 2013 ($ millions)

School Support y - ybar Z-score

Maryland 155 64 179

UVA 131 40 112

Louisville 109 18 050

UNC 92 01 003

VaTech 79 -12 -034

FSU 79 -12 -034

GaTech 71 -20 -056

NCSU 65 -26 -073

Clemson 38 -53 -147

Mean=91000 s=35697

Sum = 0 Sum = 0

Recently the mean tuition at 4-yr public collegesuniversities in the US was $6185 with a standard deviation of $1804 In NC the mean tuition was $4320 What is NCrsquos z-score

1 103

2 -103

3 239

4 1865

5 -1865

Section 34Measures of Position (also called Measures of Relative Standing)

Quartiles

5-Number Summary

Interquartile Range Another Measure of Spread

Boxplots

m = median = 34

Q1= first quartile = 23

Q3= third quartile = 42

1 1 062 2 123 3 164 4 195 5 156 6 217 7 238 6 239 5 2510 4 2811 3 2912 2 3313 1 3414 2 3615 3 3716 4 3817 5 3918 6 4119 7 4220 6 4521 5 4722 4 4923 3 5324 2 5625 1 61

Quartiles Measuring spread by examining the middleThe first quartile Q1 is the value in the

sample that has 25 of the data at or

below it (Q1 is the median of the lower

half of the sorted data)

The third quartile Q3 is the value in the

sample that has 75 of the data at or

below it (Q3 is the median of the upper

half of the sorted data)

Quartiles and median divide data into 4 pieces

Q1 M Q3

14 14 14 14

Quartiles are common measures of spread

httpoirpncsueduiradmit

httpoirpncsueduunivpeer

University of Southern California

Economic Value of College Majors

Rules for Calculating QuartilesStep 1 find the median of all the data (the median divides the data in half)

Step 2a find the median of the lower half this median is Q1Step 2b find the median of the upper half this median is Q3

Importantwhen n is odd include the overall median in both halveswhen n is even do not include the overall median in either half

Example 2 4 6 8 10 12 14 16 18 20 n = 10

Median m = (10+12)2 = 222 = 11

Q1 median of lower half 2 4 6 8 10

Q1 = 6

Q3 median of upper half 12 14 16 18 20

Q3 = 16

11

Pulse Rates n = 138

Stem Leaves4

3 4 5889 5 00123344410 5 555678889923 6 0001111112223333334444423 6 5555666666777778888888816 7 0000011222233444423 7 5555566666677788888899910 8 000011222410 8 55556677894 9 00122 9 584 10 0223

101 11 1

Median mean of pulses in locations 69 amp 70 median= (70+70)2=70

Q1 median of lower half (lower half = 69 smallest pulses) Q1 = pulse in ordered position 35Q1 = 63

Q3 median of upper half (upper half = 69 largest pulses) Q3= pulse in position 35 from the high end Q3=78

Below are the weights of 31 linemen on the NCSU football team What is the

value of the first quartile Q1

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 287

2 2575

3 2635

4 2625

Interquartile range another measure of spread

lower quartile Q1

middle quartile median upper quartile Q3

interquartile range (IQR)

IQR = Q3 ndash Q1

measures spread of middle 50 of the data

Example beginning pulse rates

Q3 = 78 Q1 = 63

IQR = 78 ndash 63 = 15

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 16: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Section 31 continuedDisplaying Quantitative Data

Histograms

Stem and Leaf Displays

Frequency HistogramsBAKER CITY HOSPITAL - LENGTH OF STAY

DISTRIBUTION

0

10

20

30

40

50

60

70

0lt2 2lt4 4lt6 6lt8 8lt10 10lt12 12lt14 14lt16 16lt18

Relative Frequency Histogram of Exam Grades

005

10

15

20

25

30

40 50 60 70 80 90Grade

Rel

ativ

e fr

eque

ncy

100

Histograms

A histogram shows three general types of information

It provides visual indication of where the approximate center of the data is

We can gain an understanding of the degree of spread or variation in the data

We can observe the shape of the distribution

Histograms Showing Different Centers

0

10

20

30

40

50

60

70

0lt2 2lt4 4lt6 6lt8 8lt10 10lt12 12lt14 14lt16 16lt18

0

10

20

30

40

50

60

70

0lt2 2lt4 4lt6 6lt8 8lt10 10lt12 12lt14 14lt16 16lt18

Histograms - Same Center Different Spread

0

10

20

30

40

50

60

70

0lt2

2lt4

4lt6

6lt8

8lt10

10lt12

12lt14

14lt16

16lt18

0

10

20

30

40

50

60

70

0lt2 2lt4 4lt6 6lt8 8lt10 10lt12 12lt14 14lt16 16lt18

Histograms Shape

A distribution is symmetric if the right and left

sides of the histogram are approximately mirror

images of each other

Symmetric distribution

Complex multimodal distribution

Not all distributions have a simple overall shape

especially when there are few observations

Skewed distribution

A distribution is skewed to the right if the right

side of the histogram (side with larger values)

extends much farther out than the left side It is

skewed to the left if the left side of the histogram

extends much farther out than the right side

Shape (cont)Female heart attack patients in New York state

Age left-skewed Cost right-skewed

Shape (cont) outliersAll 200 m Races 202 secs or less

192 1926193219381944 195 1956196219681974 198 1986199219982004 201 20160

10

20

30

40

50

60

200 m Races 202 secs or less (approx 700)

TIMES

Fre

qu

ency Usain Bolt

2008 1930Michael Johnson1996 1932

Alaska Florida

Shape (cont) Outliers

An important kind of deviation is an outlier Outliers are observations

that lie outside the overall pattern of a distribution Always look for

outliers and try to explain them

The overall pattern is fairly

symmetrical except for 2

states clearly not belonging

to the main trend Alaska

and Florida have unusual

representation of the

elderly in their population

A large gap in the

distribution is typically a

sign of an outlier

Excel Example 2012-13 NFL Salaries

3694

80

1273

609

231

2177

738

462

3081

867

692

3985

996

923

4890

126

154

5794

255

385

6698

384

615

7602

513

846

8506

643

077

9410

772

308

1031

4901

54

1121

9030

77

1212

3160

1302

7289

23

1393

1418

46

1483

5547

69

1573

9676

92

1664

3806

15

1754

7935

38

0

100

200

300

400

500

600

700

800

900

1000

Histogram

Bin

Fre

qu

ency

Statcrunch Example 2012-13 NFL Salaries

Heights of Students in Recent Stats Class (Bimodal)

ExampleGrades on a statistics exam

Data

75 66 77 66 64 73 91 65 59 86 61 86 61

58 70 77 80 58 94 78 62 79 83 54 52 45

82 48 67 55

Example-2Frequency Distribution of Grades

Class Limits Frequency40 up to 50

50 up to 60

60 up to 70

70 up to 80

80 up to 90

90 up to 100

Total

2

6

8

7

5

2

30

Example-3 Relative Frequency Distribution of Grades

Class Limits Relative Frequency40 up to 50

50 up to 60

60 up to 70

70 up to 80

80 up to 90

90 up to 100

230 = 067

630 = 200

830 = 267

730 = 233

530 = 167

230 = 067

Relative Frequency Histogram of Grades

005

10

15

20

25

30

40 50 60 70 80 90Grade

Rel

ativ

e fr

eque

ncy

100

Based on the histo-gram about what percent of the values are between 475 and 525

1 50

2 5

3 17

4 30

Stem and leaf displays Have the following general appearance

stem leaf

1 8 9

2 1 2 8 9 9

3 2 3 8 9

4 0 1

5 6 7

6 4

Example employee ages at a small company

18 21 22 19 32 33 40 41 56 57 64 28 29 29 38 39 stem 10rsquos digit leaf 1rsquos digit

18 stem=1 leaf=8 18 = 1 | 8

stem leaf

1 8 9

2 1 2 8 9 9

3 2 3 8 9

4 0 1

5 6 7

6 4

Suppose a 95 yr old is hiredstem leaf

1 8 9

2 1 2 8 9 9

3 2 3 8 9

4 0 1

5 6 7

6 4

7

8

9 5

Number of TD passes by NFL teams 2012-2013 season(stems are 10rsquos digit)

stem leaf

43

03247

2 6677789

2 01222233444

1 13467889

0 8

Pulse Rates n = 138

Stem Leaves 4 3 4 588 9 5 001233444 10 5 5556788899 23 6 00011111122233333344444 23 6 55556666667777788888888 16 7 00000112222334444 23 7 55555666666777888888999 10 8 0000112224 10 8 5555667789 4 9 0012 2 9 58 4 10 0223 10 1 11 1

AdvantagesDisadvantages of Stem-and-Leaf Displays

Advantages

1) each measurement displayed

2) ascending order in each stem row

3) relatively simple (data set not too large) Disadvantages

display becomes unwieldy for large data sets

Population of 185 US cities with between 100000 and 500000

Multiply stems by 100000

Back-to-back stem-and-leaf displays TD passes by NFL teams 1999-2000 2012-13multiply stems by 10

1999-2000 2012-13

2 4 03

6 3 7

2 3 24

6655 2 6677789

43322221100 2 01222233444

9998887666 1 67889

421 1 134

0 8

Below is a stem-and-leaf display for the pulse rates of 24 women at a health clinic How many pulses are between 67 and 77

Stems are 10rsquos digits

1 4

2 6

3 8

4 10

5 12

Other Graphical Methods for Data Time plots

plot observations in time order time on horizontal axis variable on vertical axis

Time series

measurements are taken at regular intervals (monthly unemployment quarterly GDP weather records electricity demand etc)

Heat maps word walls

Unemployment Rate by Educational Attainment

Water Use During Super Bowl XLV(Packers 31 Steelers 25)

Heat Maps

Word Wall (customer feedback)

Section 32Describing the Center of Data

Mean

Median

2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability (next section)

measures how ldquospread outrdquo the data is

Notation for Data Valuesand Sample Mean

1 2

1 2

3

The sample size is denoted by

For a variable denoted by its observations are denoted by

A common measure of center is the sample mean

The sample mean is denoted by

Shorte

n

n

y y yy

n

y

y y y y

y

n

1 21

1

ned expression for using the symbol

(uppercase Greek letter sigma)n

n

i

i n

i

i

y

y y y

yy

n

y

Simple Example of Sample Mean

Weekly TV viewing time in hours of 7 randomly selected 4th graders

19 40 16 12 10 6 and 97

1

7

1

19 40 16 12 10 6 9 112

11216

7 7

ii

ii

y

yy

Population Mean

1

population

population mea

Denoted by the Greek letter

is the size (for example =34000 for NCSU)

the value of is typically not known

we often use the sample mean

to estimat

n

e the unknown

N

ii

y

N N

y

N

value of

Connection Between Mean and Histogram

A histogram balances when supported at the mean Mean x = 1406

Histogram

0

10

20

30

40

50

60

70

118

5

125

5

132

5

139

5

146

5

153

5

16

05

Mo

re

Absences f rom Work

Fre

qu

en

cy

Frequency

The median anothermeasure of center

Given a set of n data values arranged in order of magnitude

Median= middle value n odd

mean of 2 middle values n even

Ex 2 4 6 8 10 n=5 median=6 Ex 2 4 6 8 n=4 median=(4+6)2=5

Student Pulse Rates (n=62)

38 59 60 60 62 62 63 63 64 64 65 67 68 70 70 70 70 70 70 70 71 71 72 72 73 74 74 75 75 75 75 76 77 77 77 77 78 78 79 79 80 80 80 84 84 85 85 87 90 90 91 92 93 94 94 95 96 96 96 98 98 103

Median = (75+76)2 = 755

The median splits the histogram into 2 halves of equal area

Mean balance pointMedian 50 area each half

mean 5526 years median 577years

Medians are used often

Year 2011 baseball salaries

Median $1450000 (max=$32000000 Alex Rodriguez min=$414000)

Median fan age MLB 45 NFL 43 NBA 41 NHL 39

Median existing home sales price May 2011 $166500 May 2010 $174600

Median household income (2008 dollars) 2009 $50221 2008 $52029

Examples Example n = 7

175 28 32 139 141 253 458 Example n = 7 (ordered) 28 32 139 141 175 253 458 Example n = 8

175 28 32 139 141 253 357 458

Example n =8 (ordered)

28 32 139 141 175 253 357 458

m = 141

m = (141+175)2 = 158

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496049604971524555467586

1 5245

2 49655

3 4960

4 4971

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496052455546497155877586

1 5245

2 49655

3 5546

4 4971

Properties of Mean Median1The mean and median are unique that is a

data set has only 1 mean and 1 median (the mean and median are not necessarily equal)

2The mean uses the value of every number in the data set the median does not

14

20 4 6Ex 2 4 6 8 5 5

4 2

21 4 6Ex 2 4 6 9 5 5

4 2

x m

x m

Example class pulse rates

53 64 67 67 70 76 77 77 78 83 84 85 85 89 90 90 90 90 91 96 98 103 140

23

1

23

844823

location 12th obs 85

ii

n

xx

m m

2010 2014 baseball salaries

2010

n = 845

mean = $3297828

median = $1330000

max = $33000000

2014

n = 848

mean = $3932912

median = $1456250

max = $28000000

>

Disadvantage of the mean

Can be greatly influenced by just a few observations that are much greater or much smaller than the rest of the data

Mean Median Maximum Baseball Salaries 1985 - 201419

85

1987

1989

1991

1993

1995

1997

1999

2001

2003

2005

2007

2009

2011

2013

200000

700000

1200000

1700000

2200000

2700000

3200000

3700000

0

5000000

10000000

15000000

20000000

25000000

30000000

35000000

Baseball Salaries Mean Median and Maximum 1985-2014

Mean Median Maximum

Year

Mea

n M

edia

n S

alar

y

Max

imu

m S

alar

y

Skewness comparing the mean and median

Skewed to the right (positively skewed) meangtmedian

53

490

102 7235 21 26 17 8 10 2 3 1 0 0 1

0

100

200

300

400

500

600

Freq

uenc

y

Salary ($1000s)

2011 Baseball Salaries

Skewed to the left negatively skewed

Mean lt median mean=78 median=87

Histogram of Exam Scores

0

10

20

30

20 30 40 50 60 70 80 90 100Exam Scores

Fre

qu

en

cy

Symmetric data

mean median approx equal

Bank Customers 1000-1100 am

0

5

10

15

20

Number of Customers

Fre

qu

en

cy

Section 33Describing Variability of Data

Standard Deviation

Using the Mean and Standard Deviation Together 68-95-997

Rule (Empirical Rule)

Recall 2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability

measures how ldquospread outrdquo the data is

Ways to measure variability

1 range=largest-smallest

ok sometimes in general too crude sensitive to one large or small obs

1

2 where

the middle is the mean

deviation of from the mean

( ) sum the deviations of all the s from

measure spread from the middle

i i

n

i ii

y

y y y

y y y y

1

( ) 0 always tells us nothingn

ii

y y

Example

1 2

1 2

1 2

1 2

sum of deviations from mean

49 51 50

( ) ( ) (49 50) (51 50) 1 1 0

0 100

Data set 1

Data set 2 50

( ) ( ) (0 50) (100 50) 50 50 0

x x x

x x x x

y y y

y y y y

The Sample Standard Deviation a measure of spread around the mean Square the deviation of each

observation from the mean find the square root of the ldquoaveragerdquo of these squared deviations

2

1

2

2 1

( )sample standard deviation

1

( )is called the sample variance

1

n

ii

n

ii

y ys

n

y ys

n

Calculations hellip

Mean = 634

Sum of squared deviations from mean = 852

(n minus 1) = 13 (n minus 1) is called degrees freedom (df)

s2 = variance = 85213 = 655 square inches

s = standard deviation = radic655 = 256 inches

Women height (inches)i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

2

1

2 )(1

1xx

ns

n

i

1 First calculate the variance s22 Then take the square root to get the

standard deviation s

2

1

)(1

1xx

ns

n

i

Meanplusmn 1 sd

Wersquoll never calculate these by hand so make sure to know how to get the standard deviation using your calculator Excel or other software

Population Standard Deviation

2

1

Denoted by the lower case Greek letter

is the size (for example =34000 for NCSU)

is the mean

( )population standard deviation

va

po

lue of typically not known

us

pulation

populatio

e

n

N

ii

N N

y

N

s

to estimate value of

Remarks

1 The standard deviation of a set of measurements is an estimate of the likely size of the chance error in a single measurement

Remarks (cont)

2 Note that s and s are always greater than or equal to zero

3 The larger the value of s (or s ) the greater the spread of the data

When does s=0 When does s =0

When all data values are the same

Remarks (cont)4 The standard deviation is the most

commonly used measure of risk in finance and businessndash Stocks Mutual Funds etc

5 Variance s2 sample variance 2 population variance Units are squared units of the original data square $ square gallons

Review Properties of s and s s and s are always greater than or

equal to 0

when does s = 0 s = 0 The larger the value of s (or s) the

greater the spread of the data the standard deviation of a set of

measurements is an estimate of the likely size of the chance error in a single measurement

Summary of Notation

2

SAMPLE

sample mean

sample median

sample variance

sample stand dev

y

m

s

s

2

POPULATION

population mean

population median

population variance

population stand dev

m

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

68-95-997 rule

Mean andStandard Deviation

(numerical)

Histogram(graphical)

68-95-997 rule

The 68-95-997 ruleIf the histogram of the data is

approximately bell-shaped then1) approximately of the measurements

are of the mean

that is in ( )

2) approximately of the measurement

68

within 1 standard deviation

95

within 2 standard deviation

s

are of the meas n

that is

y s y s

almost all

within 3 standard deviation

in ( 2 2 )

3) the measurements

are of the mean

that is in ( 3 3 )

s

y s y s

y s y s

68-95-997 rule 68 within 1 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

68

3434

y-s y y+s

68-95-997 rule 95 within 2 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

95

475 475

y-2s y y+2s

Example textbook costs

37548

4272

50

y

s

n

286 291 307 308 315 316 327328 340 342 346 347 348 348 349354 355 355 360 361 364 367 369371 373 377 380 381 382 385 385387 390 390 397 398 409 409 410418 422 424 425 426 428 433 434437 440 480

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( ) (33276 41820)

32percentage of data values in this interval 64

5068-95-997 rule 68

y s

y s y s

1 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 2 2 ) (29004 46092)

48percentage of data values in this interval 96

5068-95-997 rule 95

y s

y s y s

2 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 3 3 ) (24732 50364)

50percentage of data values in this interval 100

5068-95-997 rule 997

y s

y s y s

3 standard deviation interval about the mean

The best estimate of the standard deviation of the menrsquos weights

displayed in this dotplot is

1 10

2 15

3 20

4 40

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

Preceding slides Next

Z-scores Standardized Data Values

Measures the distance of a number from the mean in units of

the standard deviation

z-score corresponding to y

where

original data value

the sample mean

s the sample standard deviation

the z-score corresponding to

y yz

s

y

y

z y

Exam 1 y1 = 88 s1 = 6 your exam 1 score 91

Exam 2 y2 = 88 s2 = 10 your exam 2 score 92

Which score is better

1

2

91 88 3z 5

6 692 88 4

z 410 10

91 on exam 1 is better than 92 on exam 2

If data has mean and standard deviation

then standardizing a particular value of

indicates how many standard deviations

is above or below the mean

y s

y

y

y

Comparing SAT and ACT Scores

SAT Math Eleanorrsquos score 680

SAT mean =500 sd=100 ACT Math Geraldrsquos score 27

ACT mean=18 sd=6 Eleanorrsquos z-score z=(680-500)100=18 Geraldrsquos z-score z=(27-18)6=15 Eleanorrsquos score is better

Z-scores add to zeroStudentInstitutional Support to Athletic Depts For the 9 Public ACC

Schools 2013 ($ millions)

School Support y - ybar Z-score

Maryland 155 64 179

UVA 131 40 112

Louisville 109 18 050

UNC 92 01 003

VaTech 79 -12 -034

FSU 79 -12 -034

GaTech 71 -20 -056

NCSU 65 -26 -073

Clemson 38 -53 -147

Mean=91000 s=35697

Sum = 0 Sum = 0

Recently the mean tuition at 4-yr public collegesuniversities in the US was $6185 with a standard deviation of $1804 In NC the mean tuition was $4320 What is NCrsquos z-score

1 103

2 -103

3 239

4 1865

5 -1865

Section 34Measures of Position (also called Measures of Relative Standing)

Quartiles

5-Number Summary

Interquartile Range Another Measure of Spread

Boxplots

m = median = 34

Q1= first quartile = 23

Q3= third quartile = 42

1 1 062 2 123 3 164 4 195 5 156 6 217 7 238 6 239 5 2510 4 2811 3 2912 2 3313 1 3414 2 3615 3 3716 4 3817 5 3918 6 4119 7 4220 6 4521 5 4722 4 4923 3 5324 2 5625 1 61

Quartiles Measuring spread by examining the middleThe first quartile Q1 is the value in the

sample that has 25 of the data at or

below it (Q1 is the median of the lower

half of the sorted data)

The third quartile Q3 is the value in the

sample that has 75 of the data at or

below it (Q3 is the median of the upper

half of the sorted data)

Quartiles and median divide data into 4 pieces

Q1 M Q3

14 14 14 14

Quartiles are common measures of spread

httpoirpncsueduiradmit

httpoirpncsueduunivpeer

University of Southern California

Economic Value of College Majors

Rules for Calculating QuartilesStep 1 find the median of all the data (the median divides the data in half)

Step 2a find the median of the lower half this median is Q1Step 2b find the median of the upper half this median is Q3

Importantwhen n is odd include the overall median in both halveswhen n is even do not include the overall median in either half

Example 2 4 6 8 10 12 14 16 18 20 n = 10

Median m = (10+12)2 = 222 = 11

Q1 median of lower half 2 4 6 8 10

Q1 = 6

Q3 median of upper half 12 14 16 18 20

Q3 = 16

11

Pulse Rates n = 138

Stem Leaves4

3 4 5889 5 00123344410 5 555678889923 6 0001111112223333334444423 6 5555666666777778888888816 7 0000011222233444423 7 5555566666677788888899910 8 000011222410 8 55556677894 9 00122 9 584 10 0223

101 11 1

Median mean of pulses in locations 69 amp 70 median= (70+70)2=70

Q1 median of lower half (lower half = 69 smallest pulses) Q1 = pulse in ordered position 35Q1 = 63

Q3 median of upper half (upper half = 69 largest pulses) Q3= pulse in position 35 from the high end Q3=78

Below are the weights of 31 linemen on the NCSU football team What is the

value of the first quartile Q1

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 287

2 2575

3 2635

4 2625

Interquartile range another measure of spread

lower quartile Q1

middle quartile median upper quartile Q3

interquartile range (IQR)

IQR = Q3 ndash Q1

measures spread of middle 50 of the data

Example beginning pulse rates

Q3 = 78 Q1 = 63

IQR = 78 ndash 63 = 15

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 17: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Frequency HistogramsBAKER CITY HOSPITAL - LENGTH OF STAY

DISTRIBUTION

0

10

20

30

40

50

60

70

0lt2 2lt4 4lt6 6lt8 8lt10 10lt12 12lt14 14lt16 16lt18

Relative Frequency Histogram of Exam Grades

005

10

15

20

25

30

40 50 60 70 80 90Grade

Rel

ativ

e fr

eque

ncy

100

Histograms

A histogram shows three general types of information

It provides visual indication of where the approximate center of the data is

We can gain an understanding of the degree of spread or variation in the data

We can observe the shape of the distribution

Histograms Showing Different Centers

0

10

20

30

40

50

60

70

0lt2 2lt4 4lt6 6lt8 8lt10 10lt12 12lt14 14lt16 16lt18

0

10

20

30

40

50

60

70

0lt2 2lt4 4lt6 6lt8 8lt10 10lt12 12lt14 14lt16 16lt18

Histograms - Same Center Different Spread

0

10

20

30

40

50

60

70

0lt2

2lt4

4lt6

6lt8

8lt10

10lt12

12lt14

14lt16

16lt18

0

10

20

30

40

50

60

70

0lt2 2lt4 4lt6 6lt8 8lt10 10lt12 12lt14 14lt16 16lt18

Histograms Shape

A distribution is symmetric if the right and left

sides of the histogram are approximately mirror

images of each other

Symmetric distribution

Complex multimodal distribution

Not all distributions have a simple overall shape

especially when there are few observations

Skewed distribution

A distribution is skewed to the right if the right

side of the histogram (side with larger values)

extends much farther out than the left side It is

skewed to the left if the left side of the histogram

extends much farther out than the right side

Shape (cont)Female heart attack patients in New York state

Age left-skewed Cost right-skewed

Shape (cont) outliersAll 200 m Races 202 secs or less

192 1926193219381944 195 1956196219681974 198 1986199219982004 201 20160

10

20

30

40

50

60

200 m Races 202 secs or less (approx 700)

TIMES

Fre

qu

ency Usain Bolt

2008 1930Michael Johnson1996 1932

Alaska Florida

Shape (cont) Outliers

An important kind of deviation is an outlier Outliers are observations

that lie outside the overall pattern of a distribution Always look for

outliers and try to explain them

The overall pattern is fairly

symmetrical except for 2

states clearly not belonging

to the main trend Alaska

and Florida have unusual

representation of the

elderly in their population

A large gap in the

distribution is typically a

sign of an outlier

Excel Example 2012-13 NFL Salaries

3694

80

1273

609

231

2177

738

462

3081

867

692

3985

996

923

4890

126

154

5794

255

385

6698

384

615

7602

513

846

8506

643

077

9410

772

308

1031

4901

54

1121

9030

77

1212

3160

1302

7289

23

1393

1418

46

1483

5547

69

1573

9676

92

1664

3806

15

1754

7935

38

0

100

200

300

400

500

600

700

800

900

1000

Histogram

Bin

Fre

qu

ency

Statcrunch Example 2012-13 NFL Salaries

Heights of Students in Recent Stats Class (Bimodal)

ExampleGrades on a statistics exam

Data

75 66 77 66 64 73 91 65 59 86 61 86 61

58 70 77 80 58 94 78 62 79 83 54 52 45

82 48 67 55

Example-2Frequency Distribution of Grades

Class Limits Frequency40 up to 50

50 up to 60

60 up to 70

70 up to 80

80 up to 90

90 up to 100

Total

2

6

8

7

5

2

30

Example-3 Relative Frequency Distribution of Grades

Class Limits Relative Frequency40 up to 50

50 up to 60

60 up to 70

70 up to 80

80 up to 90

90 up to 100

230 = 067

630 = 200

830 = 267

730 = 233

530 = 167

230 = 067

Relative Frequency Histogram of Grades

005

10

15

20

25

30

40 50 60 70 80 90Grade

Rel

ativ

e fr

eque

ncy

100

Based on the histo-gram about what percent of the values are between 475 and 525

1 50

2 5

3 17

4 30

Stem and leaf displays Have the following general appearance

stem leaf

1 8 9

2 1 2 8 9 9

3 2 3 8 9

4 0 1

5 6 7

6 4

Example employee ages at a small company

18 21 22 19 32 33 40 41 56 57 64 28 29 29 38 39 stem 10rsquos digit leaf 1rsquos digit

18 stem=1 leaf=8 18 = 1 | 8

stem leaf

1 8 9

2 1 2 8 9 9

3 2 3 8 9

4 0 1

5 6 7

6 4

Suppose a 95 yr old is hiredstem leaf

1 8 9

2 1 2 8 9 9

3 2 3 8 9

4 0 1

5 6 7

6 4

7

8

9 5

Number of TD passes by NFL teams 2012-2013 season(stems are 10rsquos digit)

stem leaf

43

03247

2 6677789

2 01222233444

1 13467889

0 8

Pulse Rates n = 138

Stem Leaves 4 3 4 588 9 5 001233444 10 5 5556788899 23 6 00011111122233333344444 23 6 55556666667777788888888 16 7 00000112222334444 23 7 55555666666777888888999 10 8 0000112224 10 8 5555667789 4 9 0012 2 9 58 4 10 0223 10 1 11 1

AdvantagesDisadvantages of Stem-and-Leaf Displays

Advantages

1) each measurement displayed

2) ascending order in each stem row

3) relatively simple (data set not too large) Disadvantages

display becomes unwieldy for large data sets

Population of 185 US cities with between 100000 and 500000

Multiply stems by 100000

Back-to-back stem-and-leaf displays TD passes by NFL teams 1999-2000 2012-13multiply stems by 10

1999-2000 2012-13

2 4 03

6 3 7

2 3 24

6655 2 6677789

43322221100 2 01222233444

9998887666 1 67889

421 1 134

0 8

Below is a stem-and-leaf display for the pulse rates of 24 women at a health clinic How many pulses are between 67 and 77

Stems are 10rsquos digits

1 4

2 6

3 8

4 10

5 12

Other Graphical Methods for Data Time plots

plot observations in time order time on horizontal axis variable on vertical axis

Time series

measurements are taken at regular intervals (monthly unemployment quarterly GDP weather records electricity demand etc)

Heat maps word walls

Unemployment Rate by Educational Attainment

Water Use During Super Bowl XLV(Packers 31 Steelers 25)

Heat Maps

Word Wall (customer feedback)

Section 32Describing the Center of Data

Mean

Median

2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability (next section)

measures how ldquospread outrdquo the data is

Notation for Data Valuesand Sample Mean

1 2

1 2

3

The sample size is denoted by

For a variable denoted by its observations are denoted by

A common measure of center is the sample mean

The sample mean is denoted by

Shorte

n

n

y y yy

n

y

y y y y

y

n

1 21

1

ned expression for using the symbol

(uppercase Greek letter sigma)n

n

i

i n

i

i

y

y y y

yy

n

y

Simple Example of Sample Mean

Weekly TV viewing time in hours of 7 randomly selected 4th graders

19 40 16 12 10 6 and 97

1

7

1

19 40 16 12 10 6 9 112

11216

7 7

ii

ii

y

yy

Population Mean

1

population

population mea

Denoted by the Greek letter

is the size (for example =34000 for NCSU)

the value of is typically not known

we often use the sample mean

to estimat

n

e the unknown

N

ii

y

N N

y

N

value of

Connection Between Mean and Histogram

A histogram balances when supported at the mean Mean x = 1406

Histogram

0

10

20

30

40

50

60

70

118

5

125

5

132

5

139

5

146

5

153

5

16

05

Mo

re

Absences f rom Work

Fre

qu

en

cy

Frequency

The median anothermeasure of center

Given a set of n data values arranged in order of magnitude

Median= middle value n odd

mean of 2 middle values n even

Ex 2 4 6 8 10 n=5 median=6 Ex 2 4 6 8 n=4 median=(4+6)2=5

Student Pulse Rates (n=62)

38 59 60 60 62 62 63 63 64 64 65 67 68 70 70 70 70 70 70 70 71 71 72 72 73 74 74 75 75 75 75 76 77 77 77 77 78 78 79 79 80 80 80 84 84 85 85 87 90 90 91 92 93 94 94 95 96 96 96 98 98 103

Median = (75+76)2 = 755

The median splits the histogram into 2 halves of equal area

Mean balance pointMedian 50 area each half

mean 5526 years median 577years

Medians are used often

Year 2011 baseball salaries

Median $1450000 (max=$32000000 Alex Rodriguez min=$414000)

Median fan age MLB 45 NFL 43 NBA 41 NHL 39

Median existing home sales price May 2011 $166500 May 2010 $174600

Median household income (2008 dollars) 2009 $50221 2008 $52029

Examples Example n = 7

175 28 32 139 141 253 458 Example n = 7 (ordered) 28 32 139 141 175 253 458 Example n = 8

175 28 32 139 141 253 357 458

Example n =8 (ordered)

28 32 139 141 175 253 357 458

m = 141

m = (141+175)2 = 158

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496049604971524555467586

1 5245

2 49655

3 4960

4 4971

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496052455546497155877586

1 5245

2 49655

3 5546

4 4971

Properties of Mean Median1The mean and median are unique that is a

data set has only 1 mean and 1 median (the mean and median are not necessarily equal)

2The mean uses the value of every number in the data set the median does not

14

20 4 6Ex 2 4 6 8 5 5

4 2

21 4 6Ex 2 4 6 9 5 5

4 2

x m

x m

Example class pulse rates

53 64 67 67 70 76 77 77 78 83 84 85 85 89 90 90 90 90 91 96 98 103 140

23

1

23

844823

location 12th obs 85

ii

n

xx

m m

2010 2014 baseball salaries

2010

n = 845

mean = $3297828

median = $1330000

max = $33000000

2014

n = 848

mean = $3932912

median = $1456250

max = $28000000

>

Disadvantage of the mean

Can be greatly influenced by just a few observations that are much greater or much smaller than the rest of the data

Mean Median Maximum Baseball Salaries 1985 - 201419

85

1987

1989

1991

1993

1995

1997

1999

2001

2003

2005

2007

2009

2011

2013

200000

700000

1200000

1700000

2200000

2700000

3200000

3700000

0

5000000

10000000

15000000

20000000

25000000

30000000

35000000

Baseball Salaries Mean Median and Maximum 1985-2014

Mean Median Maximum

Year

Mea

n M

edia

n S

alar

y

Max

imu

m S

alar

y

Skewness comparing the mean and median

Skewed to the right (positively skewed) meangtmedian

53

490

102 7235 21 26 17 8 10 2 3 1 0 0 1

0

100

200

300

400

500

600

Freq

uenc

y

Salary ($1000s)

2011 Baseball Salaries

Skewed to the left negatively skewed

Mean lt median mean=78 median=87

Histogram of Exam Scores

0

10

20

30

20 30 40 50 60 70 80 90 100Exam Scores

Fre

qu

en

cy

Symmetric data

mean median approx equal

Bank Customers 1000-1100 am

0

5

10

15

20

Number of Customers

Fre

qu

en

cy

Section 33Describing Variability of Data

Standard Deviation

Using the Mean and Standard Deviation Together 68-95-997

Rule (Empirical Rule)

Recall 2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability

measures how ldquospread outrdquo the data is

Ways to measure variability

1 range=largest-smallest

ok sometimes in general too crude sensitive to one large or small obs

1

2 where

the middle is the mean

deviation of from the mean

( ) sum the deviations of all the s from

measure spread from the middle

i i

n

i ii

y

y y y

y y y y

1

( ) 0 always tells us nothingn

ii

y y

Example

1 2

1 2

1 2

1 2

sum of deviations from mean

49 51 50

( ) ( ) (49 50) (51 50) 1 1 0

0 100

Data set 1

Data set 2 50

( ) ( ) (0 50) (100 50) 50 50 0

x x x

x x x x

y y y

y y y y

The Sample Standard Deviation a measure of spread around the mean Square the deviation of each

observation from the mean find the square root of the ldquoaveragerdquo of these squared deviations

2

1

2

2 1

( )sample standard deviation

1

( )is called the sample variance

1

n

ii

n

ii

y ys

n

y ys

n

Calculations hellip

Mean = 634

Sum of squared deviations from mean = 852

(n minus 1) = 13 (n minus 1) is called degrees freedom (df)

s2 = variance = 85213 = 655 square inches

s = standard deviation = radic655 = 256 inches

Women height (inches)i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

2

1

2 )(1

1xx

ns

n

i

1 First calculate the variance s22 Then take the square root to get the

standard deviation s

2

1

)(1

1xx

ns

n

i

Meanplusmn 1 sd

Wersquoll never calculate these by hand so make sure to know how to get the standard deviation using your calculator Excel or other software

Population Standard Deviation

2

1

Denoted by the lower case Greek letter

is the size (for example =34000 for NCSU)

is the mean

( )population standard deviation

va

po

lue of typically not known

us

pulation

populatio

e

n

N

ii

N N

y

N

s

to estimate value of

Remarks

1 The standard deviation of a set of measurements is an estimate of the likely size of the chance error in a single measurement

Remarks (cont)

2 Note that s and s are always greater than or equal to zero

3 The larger the value of s (or s ) the greater the spread of the data

When does s=0 When does s =0

When all data values are the same

Remarks (cont)4 The standard deviation is the most

commonly used measure of risk in finance and businessndash Stocks Mutual Funds etc

5 Variance s2 sample variance 2 population variance Units are squared units of the original data square $ square gallons

Review Properties of s and s s and s are always greater than or

equal to 0

when does s = 0 s = 0 The larger the value of s (or s) the

greater the spread of the data the standard deviation of a set of

measurements is an estimate of the likely size of the chance error in a single measurement

Summary of Notation

2

SAMPLE

sample mean

sample median

sample variance

sample stand dev

y

m

s

s

2

POPULATION

population mean

population median

population variance

population stand dev

m

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

68-95-997 rule

Mean andStandard Deviation

(numerical)

Histogram(graphical)

68-95-997 rule

The 68-95-997 ruleIf the histogram of the data is

approximately bell-shaped then1) approximately of the measurements

are of the mean

that is in ( )

2) approximately of the measurement

68

within 1 standard deviation

95

within 2 standard deviation

s

are of the meas n

that is

y s y s

almost all

within 3 standard deviation

in ( 2 2 )

3) the measurements

are of the mean

that is in ( 3 3 )

s

y s y s

y s y s

68-95-997 rule 68 within 1 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

68

3434

y-s y y+s

68-95-997 rule 95 within 2 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

95

475 475

y-2s y y+2s

Example textbook costs

37548

4272

50

y

s

n

286 291 307 308 315 316 327328 340 342 346 347 348 348 349354 355 355 360 361 364 367 369371 373 377 380 381 382 385 385387 390 390 397 398 409 409 410418 422 424 425 426 428 433 434437 440 480

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( ) (33276 41820)

32percentage of data values in this interval 64

5068-95-997 rule 68

y s

y s y s

1 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 2 2 ) (29004 46092)

48percentage of data values in this interval 96

5068-95-997 rule 95

y s

y s y s

2 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 3 3 ) (24732 50364)

50percentage of data values in this interval 100

5068-95-997 rule 997

y s

y s y s

3 standard deviation interval about the mean

The best estimate of the standard deviation of the menrsquos weights

displayed in this dotplot is

1 10

2 15

3 20

4 40

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

Preceding slides Next

Z-scores Standardized Data Values

Measures the distance of a number from the mean in units of

the standard deviation

z-score corresponding to y

where

original data value

the sample mean

s the sample standard deviation

the z-score corresponding to

y yz

s

y

y

z y

Exam 1 y1 = 88 s1 = 6 your exam 1 score 91

Exam 2 y2 = 88 s2 = 10 your exam 2 score 92

Which score is better

1

2

91 88 3z 5

6 692 88 4

z 410 10

91 on exam 1 is better than 92 on exam 2

If data has mean and standard deviation

then standardizing a particular value of

indicates how many standard deviations

is above or below the mean

y s

y

y

y

Comparing SAT and ACT Scores

SAT Math Eleanorrsquos score 680

SAT mean =500 sd=100 ACT Math Geraldrsquos score 27

ACT mean=18 sd=6 Eleanorrsquos z-score z=(680-500)100=18 Geraldrsquos z-score z=(27-18)6=15 Eleanorrsquos score is better

Z-scores add to zeroStudentInstitutional Support to Athletic Depts For the 9 Public ACC

Schools 2013 ($ millions)

School Support y - ybar Z-score

Maryland 155 64 179

UVA 131 40 112

Louisville 109 18 050

UNC 92 01 003

VaTech 79 -12 -034

FSU 79 -12 -034

GaTech 71 -20 -056

NCSU 65 -26 -073

Clemson 38 -53 -147

Mean=91000 s=35697

Sum = 0 Sum = 0

Recently the mean tuition at 4-yr public collegesuniversities in the US was $6185 with a standard deviation of $1804 In NC the mean tuition was $4320 What is NCrsquos z-score

1 103

2 -103

3 239

4 1865

5 -1865

Section 34Measures of Position (also called Measures of Relative Standing)

Quartiles

5-Number Summary

Interquartile Range Another Measure of Spread

Boxplots

m = median = 34

Q1= first quartile = 23

Q3= third quartile = 42

1 1 062 2 123 3 164 4 195 5 156 6 217 7 238 6 239 5 2510 4 2811 3 2912 2 3313 1 3414 2 3615 3 3716 4 3817 5 3918 6 4119 7 4220 6 4521 5 4722 4 4923 3 5324 2 5625 1 61

Quartiles Measuring spread by examining the middleThe first quartile Q1 is the value in the

sample that has 25 of the data at or

below it (Q1 is the median of the lower

half of the sorted data)

The third quartile Q3 is the value in the

sample that has 75 of the data at or

below it (Q3 is the median of the upper

half of the sorted data)

Quartiles and median divide data into 4 pieces

Q1 M Q3

14 14 14 14

Quartiles are common measures of spread

httpoirpncsueduiradmit

httpoirpncsueduunivpeer

University of Southern California

Economic Value of College Majors

Rules for Calculating QuartilesStep 1 find the median of all the data (the median divides the data in half)

Step 2a find the median of the lower half this median is Q1Step 2b find the median of the upper half this median is Q3

Importantwhen n is odd include the overall median in both halveswhen n is even do not include the overall median in either half

Example 2 4 6 8 10 12 14 16 18 20 n = 10

Median m = (10+12)2 = 222 = 11

Q1 median of lower half 2 4 6 8 10

Q1 = 6

Q3 median of upper half 12 14 16 18 20

Q3 = 16

11

Pulse Rates n = 138

Stem Leaves4

3 4 5889 5 00123344410 5 555678889923 6 0001111112223333334444423 6 5555666666777778888888816 7 0000011222233444423 7 5555566666677788888899910 8 000011222410 8 55556677894 9 00122 9 584 10 0223

101 11 1

Median mean of pulses in locations 69 amp 70 median= (70+70)2=70

Q1 median of lower half (lower half = 69 smallest pulses) Q1 = pulse in ordered position 35Q1 = 63

Q3 median of upper half (upper half = 69 largest pulses) Q3= pulse in position 35 from the high end Q3=78

Below are the weights of 31 linemen on the NCSU football team What is the

value of the first quartile Q1

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 287

2 2575

3 2635

4 2625

Interquartile range another measure of spread

lower quartile Q1

middle quartile median upper quartile Q3

interquartile range (IQR)

IQR = Q3 ndash Q1

measures spread of middle 50 of the data

Example beginning pulse rates

Q3 = 78 Q1 = 63

IQR = 78 ndash 63 = 15

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 18: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Relative Frequency Histogram of Exam Grades

005

10

15

20

25

30

40 50 60 70 80 90Grade

Rel

ativ

e fr

eque

ncy

100

Histograms

A histogram shows three general types of information

It provides visual indication of where the approximate center of the data is

We can gain an understanding of the degree of spread or variation in the data

We can observe the shape of the distribution

Histograms Showing Different Centers

0

10

20

30

40

50

60

70

0lt2 2lt4 4lt6 6lt8 8lt10 10lt12 12lt14 14lt16 16lt18

0

10

20

30

40

50

60

70

0lt2 2lt4 4lt6 6lt8 8lt10 10lt12 12lt14 14lt16 16lt18

Histograms - Same Center Different Spread

0

10

20

30

40

50

60

70

0lt2

2lt4

4lt6

6lt8

8lt10

10lt12

12lt14

14lt16

16lt18

0

10

20

30

40

50

60

70

0lt2 2lt4 4lt6 6lt8 8lt10 10lt12 12lt14 14lt16 16lt18

Histograms Shape

A distribution is symmetric if the right and left

sides of the histogram are approximately mirror

images of each other

Symmetric distribution

Complex multimodal distribution

Not all distributions have a simple overall shape

especially when there are few observations

Skewed distribution

A distribution is skewed to the right if the right

side of the histogram (side with larger values)

extends much farther out than the left side It is

skewed to the left if the left side of the histogram

extends much farther out than the right side

Shape (cont)Female heart attack patients in New York state

Age left-skewed Cost right-skewed

Shape (cont) outliersAll 200 m Races 202 secs or less

192 1926193219381944 195 1956196219681974 198 1986199219982004 201 20160

10

20

30

40

50

60

200 m Races 202 secs or less (approx 700)

TIMES

Fre

qu

ency Usain Bolt

2008 1930Michael Johnson1996 1932

Alaska Florida

Shape (cont) Outliers

An important kind of deviation is an outlier Outliers are observations

that lie outside the overall pattern of a distribution Always look for

outliers and try to explain them

The overall pattern is fairly

symmetrical except for 2

states clearly not belonging

to the main trend Alaska

and Florida have unusual

representation of the

elderly in their population

A large gap in the

distribution is typically a

sign of an outlier

Excel Example 2012-13 NFL Salaries

3694

80

1273

609

231

2177

738

462

3081

867

692

3985

996

923

4890

126

154

5794

255

385

6698

384

615

7602

513

846

8506

643

077

9410

772

308

1031

4901

54

1121

9030

77

1212

3160

1302

7289

23

1393

1418

46

1483

5547

69

1573

9676

92

1664

3806

15

1754

7935

38

0

100

200

300

400

500

600

700

800

900

1000

Histogram

Bin

Fre

qu

ency

Statcrunch Example 2012-13 NFL Salaries

Heights of Students in Recent Stats Class (Bimodal)

ExampleGrades on a statistics exam

Data

75 66 77 66 64 73 91 65 59 86 61 86 61

58 70 77 80 58 94 78 62 79 83 54 52 45

82 48 67 55

Example-2Frequency Distribution of Grades

Class Limits Frequency40 up to 50

50 up to 60

60 up to 70

70 up to 80

80 up to 90

90 up to 100

Total

2

6

8

7

5

2

30

Example-3 Relative Frequency Distribution of Grades

Class Limits Relative Frequency40 up to 50

50 up to 60

60 up to 70

70 up to 80

80 up to 90

90 up to 100

230 = 067

630 = 200

830 = 267

730 = 233

530 = 167

230 = 067

Relative Frequency Histogram of Grades

005

10

15

20

25

30

40 50 60 70 80 90Grade

Rel

ativ

e fr

eque

ncy

100

Based on the histo-gram about what percent of the values are between 475 and 525

1 50

2 5

3 17

4 30

Stem and leaf displays Have the following general appearance

stem leaf

1 8 9

2 1 2 8 9 9

3 2 3 8 9

4 0 1

5 6 7

6 4

Example employee ages at a small company

18 21 22 19 32 33 40 41 56 57 64 28 29 29 38 39 stem 10rsquos digit leaf 1rsquos digit

18 stem=1 leaf=8 18 = 1 | 8

stem leaf

1 8 9

2 1 2 8 9 9

3 2 3 8 9

4 0 1

5 6 7

6 4

Suppose a 95 yr old is hiredstem leaf

1 8 9

2 1 2 8 9 9

3 2 3 8 9

4 0 1

5 6 7

6 4

7

8

9 5

Number of TD passes by NFL teams 2012-2013 season(stems are 10rsquos digit)

stem leaf

43

03247

2 6677789

2 01222233444

1 13467889

0 8

Pulse Rates n = 138

Stem Leaves 4 3 4 588 9 5 001233444 10 5 5556788899 23 6 00011111122233333344444 23 6 55556666667777788888888 16 7 00000112222334444 23 7 55555666666777888888999 10 8 0000112224 10 8 5555667789 4 9 0012 2 9 58 4 10 0223 10 1 11 1

AdvantagesDisadvantages of Stem-and-Leaf Displays

Advantages

1) each measurement displayed

2) ascending order in each stem row

3) relatively simple (data set not too large) Disadvantages

display becomes unwieldy for large data sets

Population of 185 US cities with between 100000 and 500000

Multiply stems by 100000

Back-to-back stem-and-leaf displays TD passes by NFL teams 1999-2000 2012-13multiply stems by 10

1999-2000 2012-13

2 4 03

6 3 7

2 3 24

6655 2 6677789

43322221100 2 01222233444

9998887666 1 67889

421 1 134

0 8

Below is a stem-and-leaf display for the pulse rates of 24 women at a health clinic How many pulses are between 67 and 77

Stems are 10rsquos digits

1 4

2 6

3 8

4 10

5 12

Other Graphical Methods for Data Time plots

plot observations in time order time on horizontal axis variable on vertical axis

Time series

measurements are taken at regular intervals (monthly unemployment quarterly GDP weather records electricity demand etc)

Heat maps word walls

Unemployment Rate by Educational Attainment

Water Use During Super Bowl XLV(Packers 31 Steelers 25)

Heat Maps

Word Wall (customer feedback)

Section 32Describing the Center of Data

Mean

Median

2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability (next section)

measures how ldquospread outrdquo the data is

Notation for Data Valuesand Sample Mean

1 2

1 2

3

The sample size is denoted by

For a variable denoted by its observations are denoted by

A common measure of center is the sample mean

The sample mean is denoted by

Shorte

n

n

y y yy

n

y

y y y y

y

n

1 21

1

ned expression for using the symbol

(uppercase Greek letter sigma)n

n

i

i n

i

i

y

y y y

yy

n

y

Simple Example of Sample Mean

Weekly TV viewing time in hours of 7 randomly selected 4th graders

19 40 16 12 10 6 and 97

1

7

1

19 40 16 12 10 6 9 112

11216

7 7

ii

ii

y

yy

Population Mean

1

population

population mea

Denoted by the Greek letter

is the size (for example =34000 for NCSU)

the value of is typically not known

we often use the sample mean

to estimat

n

e the unknown

N

ii

y

N N

y

N

value of

Connection Between Mean and Histogram

A histogram balances when supported at the mean Mean x = 1406

Histogram

0

10

20

30

40

50

60

70

118

5

125

5

132

5

139

5

146

5

153

5

16

05

Mo

re

Absences f rom Work

Fre

qu

en

cy

Frequency

The median anothermeasure of center

Given a set of n data values arranged in order of magnitude

Median= middle value n odd

mean of 2 middle values n even

Ex 2 4 6 8 10 n=5 median=6 Ex 2 4 6 8 n=4 median=(4+6)2=5

Student Pulse Rates (n=62)

38 59 60 60 62 62 63 63 64 64 65 67 68 70 70 70 70 70 70 70 71 71 72 72 73 74 74 75 75 75 75 76 77 77 77 77 78 78 79 79 80 80 80 84 84 85 85 87 90 90 91 92 93 94 94 95 96 96 96 98 98 103

Median = (75+76)2 = 755

The median splits the histogram into 2 halves of equal area

Mean balance pointMedian 50 area each half

mean 5526 years median 577years

Medians are used often

Year 2011 baseball salaries

Median $1450000 (max=$32000000 Alex Rodriguez min=$414000)

Median fan age MLB 45 NFL 43 NBA 41 NHL 39

Median existing home sales price May 2011 $166500 May 2010 $174600

Median household income (2008 dollars) 2009 $50221 2008 $52029

Examples Example n = 7

175 28 32 139 141 253 458 Example n = 7 (ordered) 28 32 139 141 175 253 458 Example n = 8

175 28 32 139 141 253 357 458

Example n =8 (ordered)

28 32 139 141 175 253 357 458

m = 141

m = (141+175)2 = 158

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496049604971524555467586

1 5245

2 49655

3 4960

4 4971

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496052455546497155877586

1 5245

2 49655

3 5546

4 4971

Properties of Mean Median1The mean and median are unique that is a

data set has only 1 mean and 1 median (the mean and median are not necessarily equal)

2The mean uses the value of every number in the data set the median does not

14

20 4 6Ex 2 4 6 8 5 5

4 2

21 4 6Ex 2 4 6 9 5 5

4 2

x m

x m

Example class pulse rates

53 64 67 67 70 76 77 77 78 83 84 85 85 89 90 90 90 90 91 96 98 103 140

23

1

23

844823

location 12th obs 85

ii

n

xx

m m

2010 2014 baseball salaries

2010

n = 845

mean = $3297828

median = $1330000

max = $33000000

2014

n = 848

mean = $3932912

median = $1456250

max = $28000000

>

Disadvantage of the mean

Can be greatly influenced by just a few observations that are much greater or much smaller than the rest of the data

Mean Median Maximum Baseball Salaries 1985 - 201419

85

1987

1989

1991

1993

1995

1997

1999

2001

2003

2005

2007

2009

2011

2013

200000

700000

1200000

1700000

2200000

2700000

3200000

3700000

0

5000000

10000000

15000000

20000000

25000000

30000000

35000000

Baseball Salaries Mean Median and Maximum 1985-2014

Mean Median Maximum

Year

Mea

n M

edia

n S

alar

y

Max

imu

m S

alar

y

Skewness comparing the mean and median

Skewed to the right (positively skewed) meangtmedian

53

490

102 7235 21 26 17 8 10 2 3 1 0 0 1

0

100

200

300

400

500

600

Freq

uenc

y

Salary ($1000s)

2011 Baseball Salaries

Skewed to the left negatively skewed

Mean lt median mean=78 median=87

Histogram of Exam Scores

0

10

20

30

20 30 40 50 60 70 80 90 100Exam Scores

Fre

qu

en

cy

Symmetric data

mean median approx equal

Bank Customers 1000-1100 am

0

5

10

15

20

Number of Customers

Fre

qu

en

cy

Section 33Describing Variability of Data

Standard Deviation

Using the Mean and Standard Deviation Together 68-95-997

Rule (Empirical Rule)

Recall 2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability

measures how ldquospread outrdquo the data is

Ways to measure variability

1 range=largest-smallest

ok sometimes in general too crude sensitive to one large or small obs

1

2 where

the middle is the mean

deviation of from the mean

( ) sum the deviations of all the s from

measure spread from the middle

i i

n

i ii

y

y y y

y y y y

1

( ) 0 always tells us nothingn

ii

y y

Example

1 2

1 2

1 2

1 2

sum of deviations from mean

49 51 50

( ) ( ) (49 50) (51 50) 1 1 0

0 100

Data set 1

Data set 2 50

( ) ( ) (0 50) (100 50) 50 50 0

x x x

x x x x

y y y

y y y y

The Sample Standard Deviation a measure of spread around the mean Square the deviation of each

observation from the mean find the square root of the ldquoaveragerdquo of these squared deviations

2

1

2

2 1

( )sample standard deviation

1

( )is called the sample variance

1

n

ii

n

ii

y ys

n

y ys

n

Calculations hellip

Mean = 634

Sum of squared deviations from mean = 852

(n minus 1) = 13 (n minus 1) is called degrees freedom (df)

s2 = variance = 85213 = 655 square inches

s = standard deviation = radic655 = 256 inches

Women height (inches)i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

2

1

2 )(1

1xx

ns

n

i

1 First calculate the variance s22 Then take the square root to get the

standard deviation s

2

1

)(1

1xx

ns

n

i

Meanplusmn 1 sd

Wersquoll never calculate these by hand so make sure to know how to get the standard deviation using your calculator Excel or other software

Population Standard Deviation

2

1

Denoted by the lower case Greek letter

is the size (for example =34000 for NCSU)

is the mean

( )population standard deviation

va

po

lue of typically not known

us

pulation

populatio

e

n

N

ii

N N

y

N

s

to estimate value of

Remarks

1 The standard deviation of a set of measurements is an estimate of the likely size of the chance error in a single measurement

Remarks (cont)

2 Note that s and s are always greater than or equal to zero

3 The larger the value of s (or s ) the greater the spread of the data

When does s=0 When does s =0

When all data values are the same

Remarks (cont)4 The standard deviation is the most

commonly used measure of risk in finance and businessndash Stocks Mutual Funds etc

5 Variance s2 sample variance 2 population variance Units are squared units of the original data square $ square gallons

Review Properties of s and s s and s are always greater than or

equal to 0

when does s = 0 s = 0 The larger the value of s (or s) the

greater the spread of the data the standard deviation of a set of

measurements is an estimate of the likely size of the chance error in a single measurement

Summary of Notation

2

SAMPLE

sample mean

sample median

sample variance

sample stand dev

y

m

s

s

2

POPULATION

population mean

population median

population variance

population stand dev

m

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

68-95-997 rule

Mean andStandard Deviation

(numerical)

Histogram(graphical)

68-95-997 rule

The 68-95-997 ruleIf the histogram of the data is

approximately bell-shaped then1) approximately of the measurements

are of the mean

that is in ( )

2) approximately of the measurement

68

within 1 standard deviation

95

within 2 standard deviation

s

are of the meas n

that is

y s y s

almost all

within 3 standard deviation

in ( 2 2 )

3) the measurements

are of the mean

that is in ( 3 3 )

s

y s y s

y s y s

68-95-997 rule 68 within 1 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

68

3434

y-s y y+s

68-95-997 rule 95 within 2 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

95

475 475

y-2s y y+2s

Example textbook costs

37548

4272

50

y

s

n

286 291 307 308 315 316 327328 340 342 346 347 348 348 349354 355 355 360 361 364 367 369371 373 377 380 381 382 385 385387 390 390 397 398 409 409 410418 422 424 425 426 428 433 434437 440 480

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( ) (33276 41820)

32percentage of data values in this interval 64

5068-95-997 rule 68

y s

y s y s

1 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 2 2 ) (29004 46092)

48percentage of data values in this interval 96

5068-95-997 rule 95

y s

y s y s

2 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 3 3 ) (24732 50364)

50percentage of data values in this interval 100

5068-95-997 rule 997

y s

y s y s

3 standard deviation interval about the mean

The best estimate of the standard deviation of the menrsquos weights

displayed in this dotplot is

1 10

2 15

3 20

4 40

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

Preceding slides Next

Z-scores Standardized Data Values

Measures the distance of a number from the mean in units of

the standard deviation

z-score corresponding to y

where

original data value

the sample mean

s the sample standard deviation

the z-score corresponding to

y yz

s

y

y

z y

Exam 1 y1 = 88 s1 = 6 your exam 1 score 91

Exam 2 y2 = 88 s2 = 10 your exam 2 score 92

Which score is better

1

2

91 88 3z 5

6 692 88 4

z 410 10

91 on exam 1 is better than 92 on exam 2

If data has mean and standard deviation

then standardizing a particular value of

indicates how many standard deviations

is above or below the mean

y s

y

y

y

Comparing SAT and ACT Scores

SAT Math Eleanorrsquos score 680

SAT mean =500 sd=100 ACT Math Geraldrsquos score 27

ACT mean=18 sd=6 Eleanorrsquos z-score z=(680-500)100=18 Geraldrsquos z-score z=(27-18)6=15 Eleanorrsquos score is better

Z-scores add to zeroStudentInstitutional Support to Athletic Depts For the 9 Public ACC

Schools 2013 ($ millions)

School Support y - ybar Z-score

Maryland 155 64 179

UVA 131 40 112

Louisville 109 18 050

UNC 92 01 003

VaTech 79 -12 -034

FSU 79 -12 -034

GaTech 71 -20 -056

NCSU 65 -26 -073

Clemson 38 -53 -147

Mean=91000 s=35697

Sum = 0 Sum = 0

Recently the mean tuition at 4-yr public collegesuniversities in the US was $6185 with a standard deviation of $1804 In NC the mean tuition was $4320 What is NCrsquos z-score

1 103

2 -103

3 239

4 1865

5 -1865

Section 34Measures of Position (also called Measures of Relative Standing)

Quartiles

5-Number Summary

Interquartile Range Another Measure of Spread

Boxplots

m = median = 34

Q1= first quartile = 23

Q3= third quartile = 42

1 1 062 2 123 3 164 4 195 5 156 6 217 7 238 6 239 5 2510 4 2811 3 2912 2 3313 1 3414 2 3615 3 3716 4 3817 5 3918 6 4119 7 4220 6 4521 5 4722 4 4923 3 5324 2 5625 1 61

Quartiles Measuring spread by examining the middleThe first quartile Q1 is the value in the

sample that has 25 of the data at or

below it (Q1 is the median of the lower

half of the sorted data)

The third quartile Q3 is the value in the

sample that has 75 of the data at or

below it (Q3 is the median of the upper

half of the sorted data)

Quartiles and median divide data into 4 pieces

Q1 M Q3

14 14 14 14

Quartiles are common measures of spread

httpoirpncsueduiradmit

httpoirpncsueduunivpeer

University of Southern California

Economic Value of College Majors

Rules for Calculating QuartilesStep 1 find the median of all the data (the median divides the data in half)

Step 2a find the median of the lower half this median is Q1Step 2b find the median of the upper half this median is Q3

Importantwhen n is odd include the overall median in both halveswhen n is even do not include the overall median in either half

Example 2 4 6 8 10 12 14 16 18 20 n = 10

Median m = (10+12)2 = 222 = 11

Q1 median of lower half 2 4 6 8 10

Q1 = 6

Q3 median of upper half 12 14 16 18 20

Q3 = 16

11

Pulse Rates n = 138

Stem Leaves4

3 4 5889 5 00123344410 5 555678889923 6 0001111112223333334444423 6 5555666666777778888888816 7 0000011222233444423 7 5555566666677788888899910 8 000011222410 8 55556677894 9 00122 9 584 10 0223

101 11 1

Median mean of pulses in locations 69 amp 70 median= (70+70)2=70

Q1 median of lower half (lower half = 69 smallest pulses) Q1 = pulse in ordered position 35Q1 = 63

Q3 median of upper half (upper half = 69 largest pulses) Q3= pulse in position 35 from the high end Q3=78

Below are the weights of 31 linemen on the NCSU football team What is the

value of the first quartile Q1

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 287

2 2575

3 2635

4 2625

Interquartile range another measure of spread

lower quartile Q1

middle quartile median upper quartile Q3

interquartile range (IQR)

IQR = Q3 ndash Q1

measures spread of middle 50 of the data

Example beginning pulse rates

Q3 = 78 Q1 = 63

IQR = 78 ndash 63 = 15

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 19: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Histograms

A histogram shows three general types of information

It provides visual indication of where the approximate center of the data is

We can gain an understanding of the degree of spread or variation in the data

We can observe the shape of the distribution

Histograms Showing Different Centers

0

10

20

30

40

50

60

70

0lt2 2lt4 4lt6 6lt8 8lt10 10lt12 12lt14 14lt16 16lt18

0

10

20

30

40

50

60

70

0lt2 2lt4 4lt6 6lt8 8lt10 10lt12 12lt14 14lt16 16lt18

Histograms - Same Center Different Spread

0

10

20

30

40

50

60

70

0lt2

2lt4

4lt6

6lt8

8lt10

10lt12

12lt14

14lt16

16lt18

0

10

20

30

40

50

60

70

0lt2 2lt4 4lt6 6lt8 8lt10 10lt12 12lt14 14lt16 16lt18

Histograms Shape

A distribution is symmetric if the right and left

sides of the histogram are approximately mirror

images of each other

Symmetric distribution

Complex multimodal distribution

Not all distributions have a simple overall shape

especially when there are few observations

Skewed distribution

A distribution is skewed to the right if the right

side of the histogram (side with larger values)

extends much farther out than the left side It is

skewed to the left if the left side of the histogram

extends much farther out than the right side

Shape (cont)Female heart attack patients in New York state

Age left-skewed Cost right-skewed

Shape (cont) outliersAll 200 m Races 202 secs or less

192 1926193219381944 195 1956196219681974 198 1986199219982004 201 20160

10

20

30

40

50

60

200 m Races 202 secs or less (approx 700)

TIMES

Fre

qu

ency Usain Bolt

2008 1930Michael Johnson1996 1932

Alaska Florida

Shape (cont) Outliers

An important kind of deviation is an outlier Outliers are observations

that lie outside the overall pattern of a distribution Always look for

outliers and try to explain them

The overall pattern is fairly

symmetrical except for 2

states clearly not belonging

to the main trend Alaska

and Florida have unusual

representation of the

elderly in their population

A large gap in the

distribution is typically a

sign of an outlier

Excel Example 2012-13 NFL Salaries

3694

80

1273

609

231

2177

738

462

3081

867

692

3985

996

923

4890

126

154

5794

255

385

6698

384

615

7602

513

846

8506

643

077

9410

772

308

1031

4901

54

1121

9030

77

1212

3160

1302

7289

23

1393

1418

46

1483

5547

69

1573

9676

92

1664

3806

15

1754

7935

38

0

100

200

300

400

500

600

700

800

900

1000

Histogram

Bin

Fre

qu

ency

Statcrunch Example 2012-13 NFL Salaries

Heights of Students in Recent Stats Class (Bimodal)

ExampleGrades on a statistics exam

Data

75 66 77 66 64 73 91 65 59 86 61 86 61

58 70 77 80 58 94 78 62 79 83 54 52 45

82 48 67 55

Example-2Frequency Distribution of Grades

Class Limits Frequency40 up to 50

50 up to 60

60 up to 70

70 up to 80

80 up to 90

90 up to 100

Total

2

6

8

7

5

2

30

Example-3 Relative Frequency Distribution of Grades

Class Limits Relative Frequency40 up to 50

50 up to 60

60 up to 70

70 up to 80

80 up to 90

90 up to 100

230 = 067

630 = 200

830 = 267

730 = 233

530 = 167

230 = 067

Relative Frequency Histogram of Grades

005

10

15

20

25

30

40 50 60 70 80 90Grade

Rel

ativ

e fr

eque

ncy

100

Based on the histo-gram about what percent of the values are between 475 and 525

1 50

2 5

3 17

4 30

Stem and leaf displays Have the following general appearance

stem leaf

1 8 9

2 1 2 8 9 9

3 2 3 8 9

4 0 1

5 6 7

6 4

Example employee ages at a small company

18 21 22 19 32 33 40 41 56 57 64 28 29 29 38 39 stem 10rsquos digit leaf 1rsquos digit

18 stem=1 leaf=8 18 = 1 | 8

stem leaf

1 8 9

2 1 2 8 9 9

3 2 3 8 9

4 0 1

5 6 7

6 4

Suppose a 95 yr old is hiredstem leaf

1 8 9

2 1 2 8 9 9

3 2 3 8 9

4 0 1

5 6 7

6 4

7

8

9 5

Number of TD passes by NFL teams 2012-2013 season(stems are 10rsquos digit)

stem leaf

43

03247

2 6677789

2 01222233444

1 13467889

0 8

Pulse Rates n = 138

Stem Leaves 4 3 4 588 9 5 001233444 10 5 5556788899 23 6 00011111122233333344444 23 6 55556666667777788888888 16 7 00000112222334444 23 7 55555666666777888888999 10 8 0000112224 10 8 5555667789 4 9 0012 2 9 58 4 10 0223 10 1 11 1

AdvantagesDisadvantages of Stem-and-Leaf Displays

Advantages

1) each measurement displayed

2) ascending order in each stem row

3) relatively simple (data set not too large) Disadvantages

display becomes unwieldy for large data sets

Population of 185 US cities with between 100000 and 500000

Multiply stems by 100000

Back-to-back stem-and-leaf displays TD passes by NFL teams 1999-2000 2012-13multiply stems by 10

1999-2000 2012-13

2 4 03

6 3 7

2 3 24

6655 2 6677789

43322221100 2 01222233444

9998887666 1 67889

421 1 134

0 8

Below is a stem-and-leaf display for the pulse rates of 24 women at a health clinic How many pulses are between 67 and 77

Stems are 10rsquos digits

1 4

2 6

3 8

4 10

5 12

Other Graphical Methods for Data Time plots

plot observations in time order time on horizontal axis variable on vertical axis

Time series

measurements are taken at regular intervals (monthly unemployment quarterly GDP weather records electricity demand etc)

Heat maps word walls

Unemployment Rate by Educational Attainment

Water Use During Super Bowl XLV(Packers 31 Steelers 25)

Heat Maps

Word Wall (customer feedback)

Section 32Describing the Center of Data

Mean

Median

2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability (next section)

measures how ldquospread outrdquo the data is

Notation for Data Valuesand Sample Mean

1 2

1 2

3

The sample size is denoted by

For a variable denoted by its observations are denoted by

A common measure of center is the sample mean

The sample mean is denoted by

Shorte

n

n

y y yy

n

y

y y y y

y

n

1 21

1

ned expression for using the symbol

(uppercase Greek letter sigma)n

n

i

i n

i

i

y

y y y

yy

n

y

Simple Example of Sample Mean

Weekly TV viewing time in hours of 7 randomly selected 4th graders

19 40 16 12 10 6 and 97

1

7

1

19 40 16 12 10 6 9 112

11216

7 7

ii

ii

y

yy

Population Mean

1

population

population mea

Denoted by the Greek letter

is the size (for example =34000 for NCSU)

the value of is typically not known

we often use the sample mean

to estimat

n

e the unknown

N

ii

y

N N

y

N

value of

Connection Between Mean and Histogram

A histogram balances when supported at the mean Mean x = 1406

Histogram

0

10

20

30

40

50

60

70

118

5

125

5

132

5

139

5

146

5

153

5

16

05

Mo

re

Absences f rom Work

Fre

qu

en

cy

Frequency

The median anothermeasure of center

Given a set of n data values arranged in order of magnitude

Median= middle value n odd

mean of 2 middle values n even

Ex 2 4 6 8 10 n=5 median=6 Ex 2 4 6 8 n=4 median=(4+6)2=5

Student Pulse Rates (n=62)

38 59 60 60 62 62 63 63 64 64 65 67 68 70 70 70 70 70 70 70 71 71 72 72 73 74 74 75 75 75 75 76 77 77 77 77 78 78 79 79 80 80 80 84 84 85 85 87 90 90 91 92 93 94 94 95 96 96 96 98 98 103

Median = (75+76)2 = 755

The median splits the histogram into 2 halves of equal area

Mean balance pointMedian 50 area each half

mean 5526 years median 577years

Medians are used often

Year 2011 baseball salaries

Median $1450000 (max=$32000000 Alex Rodriguez min=$414000)

Median fan age MLB 45 NFL 43 NBA 41 NHL 39

Median existing home sales price May 2011 $166500 May 2010 $174600

Median household income (2008 dollars) 2009 $50221 2008 $52029

Examples Example n = 7

175 28 32 139 141 253 458 Example n = 7 (ordered) 28 32 139 141 175 253 458 Example n = 8

175 28 32 139 141 253 357 458

Example n =8 (ordered)

28 32 139 141 175 253 357 458

m = 141

m = (141+175)2 = 158

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496049604971524555467586

1 5245

2 49655

3 4960

4 4971

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496052455546497155877586

1 5245

2 49655

3 5546

4 4971

Properties of Mean Median1The mean and median are unique that is a

data set has only 1 mean and 1 median (the mean and median are not necessarily equal)

2The mean uses the value of every number in the data set the median does not

14

20 4 6Ex 2 4 6 8 5 5

4 2

21 4 6Ex 2 4 6 9 5 5

4 2

x m

x m

Example class pulse rates

53 64 67 67 70 76 77 77 78 83 84 85 85 89 90 90 90 90 91 96 98 103 140

23

1

23

844823

location 12th obs 85

ii

n

xx

m m

2010 2014 baseball salaries

2010

n = 845

mean = $3297828

median = $1330000

max = $33000000

2014

n = 848

mean = $3932912

median = $1456250

max = $28000000

>

Disadvantage of the mean

Can be greatly influenced by just a few observations that are much greater or much smaller than the rest of the data

Mean Median Maximum Baseball Salaries 1985 - 201419

85

1987

1989

1991

1993

1995

1997

1999

2001

2003

2005

2007

2009

2011

2013

200000

700000

1200000

1700000

2200000

2700000

3200000

3700000

0

5000000

10000000

15000000

20000000

25000000

30000000

35000000

Baseball Salaries Mean Median and Maximum 1985-2014

Mean Median Maximum

Year

Mea

n M

edia

n S

alar

y

Max

imu

m S

alar

y

Skewness comparing the mean and median

Skewed to the right (positively skewed) meangtmedian

53

490

102 7235 21 26 17 8 10 2 3 1 0 0 1

0

100

200

300

400

500

600

Freq

uenc

y

Salary ($1000s)

2011 Baseball Salaries

Skewed to the left negatively skewed

Mean lt median mean=78 median=87

Histogram of Exam Scores

0

10

20

30

20 30 40 50 60 70 80 90 100Exam Scores

Fre

qu

en

cy

Symmetric data

mean median approx equal

Bank Customers 1000-1100 am

0

5

10

15

20

Number of Customers

Fre

qu

en

cy

Section 33Describing Variability of Data

Standard Deviation

Using the Mean and Standard Deviation Together 68-95-997

Rule (Empirical Rule)

Recall 2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability

measures how ldquospread outrdquo the data is

Ways to measure variability

1 range=largest-smallest

ok sometimes in general too crude sensitive to one large or small obs

1

2 where

the middle is the mean

deviation of from the mean

( ) sum the deviations of all the s from

measure spread from the middle

i i

n

i ii

y

y y y

y y y y

1

( ) 0 always tells us nothingn

ii

y y

Example

1 2

1 2

1 2

1 2

sum of deviations from mean

49 51 50

( ) ( ) (49 50) (51 50) 1 1 0

0 100

Data set 1

Data set 2 50

( ) ( ) (0 50) (100 50) 50 50 0

x x x

x x x x

y y y

y y y y

The Sample Standard Deviation a measure of spread around the mean Square the deviation of each

observation from the mean find the square root of the ldquoaveragerdquo of these squared deviations

2

1

2

2 1

( )sample standard deviation

1

( )is called the sample variance

1

n

ii

n

ii

y ys

n

y ys

n

Calculations hellip

Mean = 634

Sum of squared deviations from mean = 852

(n minus 1) = 13 (n minus 1) is called degrees freedom (df)

s2 = variance = 85213 = 655 square inches

s = standard deviation = radic655 = 256 inches

Women height (inches)i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

2

1

2 )(1

1xx

ns

n

i

1 First calculate the variance s22 Then take the square root to get the

standard deviation s

2

1

)(1

1xx

ns

n

i

Meanplusmn 1 sd

Wersquoll never calculate these by hand so make sure to know how to get the standard deviation using your calculator Excel or other software

Population Standard Deviation

2

1

Denoted by the lower case Greek letter

is the size (for example =34000 for NCSU)

is the mean

( )population standard deviation

va

po

lue of typically not known

us

pulation

populatio

e

n

N

ii

N N

y

N

s

to estimate value of

Remarks

1 The standard deviation of a set of measurements is an estimate of the likely size of the chance error in a single measurement

Remarks (cont)

2 Note that s and s are always greater than or equal to zero

3 The larger the value of s (or s ) the greater the spread of the data

When does s=0 When does s =0

When all data values are the same

Remarks (cont)4 The standard deviation is the most

commonly used measure of risk in finance and businessndash Stocks Mutual Funds etc

5 Variance s2 sample variance 2 population variance Units are squared units of the original data square $ square gallons

Review Properties of s and s s and s are always greater than or

equal to 0

when does s = 0 s = 0 The larger the value of s (or s) the

greater the spread of the data the standard deviation of a set of

measurements is an estimate of the likely size of the chance error in a single measurement

Summary of Notation

2

SAMPLE

sample mean

sample median

sample variance

sample stand dev

y

m

s

s

2

POPULATION

population mean

population median

population variance

population stand dev

m

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

68-95-997 rule

Mean andStandard Deviation

(numerical)

Histogram(graphical)

68-95-997 rule

The 68-95-997 ruleIf the histogram of the data is

approximately bell-shaped then1) approximately of the measurements

are of the mean

that is in ( )

2) approximately of the measurement

68

within 1 standard deviation

95

within 2 standard deviation

s

are of the meas n

that is

y s y s

almost all

within 3 standard deviation

in ( 2 2 )

3) the measurements

are of the mean

that is in ( 3 3 )

s

y s y s

y s y s

68-95-997 rule 68 within 1 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

68

3434

y-s y y+s

68-95-997 rule 95 within 2 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

95

475 475

y-2s y y+2s

Example textbook costs

37548

4272

50

y

s

n

286 291 307 308 315 316 327328 340 342 346 347 348 348 349354 355 355 360 361 364 367 369371 373 377 380 381 382 385 385387 390 390 397 398 409 409 410418 422 424 425 426 428 433 434437 440 480

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( ) (33276 41820)

32percentage of data values in this interval 64

5068-95-997 rule 68

y s

y s y s

1 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 2 2 ) (29004 46092)

48percentage of data values in this interval 96

5068-95-997 rule 95

y s

y s y s

2 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 3 3 ) (24732 50364)

50percentage of data values in this interval 100

5068-95-997 rule 997

y s

y s y s

3 standard deviation interval about the mean

The best estimate of the standard deviation of the menrsquos weights

displayed in this dotplot is

1 10

2 15

3 20

4 40

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

Preceding slides Next

Z-scores Standardized Data Values

Measures the distance of a number from the mean in units of

the standard deviation

z-score corresponding to y

where

original data value

the sample mean

s the sample standard deviation

the z-score corresponding to

y yz

s

y

y

z y

Exam 1 y1 = 88 s1 = 6 your exam 1 score 91

Exam 2 y2 = 88 s2 = 10 your exam 2 score 92

Which score is better

1

2

91 88 3z 5

6 692 88 4

z 410 10

91 on exam 1 is better than 92 on exam 2

If data has mean and standard deviation

then standardizing a particular value of

indicates how many standard deviations

is above or below the mean

y s

y

y

y

Comparing SAT and ACT Scores

SAT Math Eleanorrsquos score 680

SAT mean =500 sd=100 ACT Math Geraldrsquos score 27

ACT mean=18 sd=6 Eleanorrsquos z-score z=(680-500)100=18 Geraldrsquos z-score z=(27-18)6=15 Eleanorrsquos score is better

Z-scores add to zeroStudentInstitutional Support to Athletic Depts For the 9 Public ACC

Schools 2013 ($ millions)

School Support y - ybar Z-score

Maryland 155 64 179

UVA 131 40 112

Louisville 109 18 050

UNC 92 01 003

VaTech 79 -12 -034

FSU 79 -12 -034

GaTech 71 -20 -056

NCSU 65 -26 -073

Clemson 38 -53 -147

Mean=91000 s=35697

Sum = 0 Sum = 0

Recently the mean tuition at 4-yr public collegesuniversities in the US was $6185 with a standard deviation of $1804 In NC the mean tuition was $4320 What is NCrsquos z-score

1 103

2 -103

3 239

4 1865

5 -1865

Section 34Measures of Position (also called Measures of Relative Standing)

Quartiles

5-Number Summary

Interquartile Range Another Measure of Spread

Boxplots

m = median = 34

Q1= first quartile = 23

Q3= third quartile = 42

1 1 062 2 123 3 164 4 195 5 156 6 217 7 238 6 239 5 2510 4 2811 3 2912 2 3313 1 3414 2 3615 3 3716 4 3817 5 3918 6 4119 7 4220 6 4521 5 4722 4 4923 3 5324 2 5625 1 61

Quartiles Measuring spread by examining the middleThe first quartile Q1 is the value in the

sample that has 25 of the data at or

below it (Q1 is the median of the lower

half of the sorted data)

The third quartile Q3 is the value in the

sample that has 75 of the data at or

below it (Q3 is the median of the upper

half of the sorted data)

Quartiles and median divide data into 4 pieces

Q1 M Q3

14 14 14 14

Quartiles are common measures of spread

httpoirpncsueduiradmit

httpoirpncsueduunivpeer

University of Southern California

Economic Value of College Majors

Rules for Calculating QuartilesStep 1 find the median of all the data (the median divides the data in half)

Step 2a find the median of the lower half this median is Q1Step 2b find the median of the upper half this median is Q3

Importantwhen n is odd include the overall median in both halveswhen n is even do not include the overall median in either half

Example 2 4 6 8 10 12 14 16 18 20 n = 10

Median m = (10+12)2 = 222 = 11

Q1 median of lower half 2 4 6 8 10

Q1 = 6

Q3 median of upper half 12 14 16 18 20

Q3 = 16

11

Pulse Rates n = 138

Stem Leaves4

3 4 5889 5 00123344410 5 555678889923 6 0001111112223333334444423 6 5555666666777778888888816 7 0000011222233444423 7 5555566666677788888899910 8 000011222410 8 55556677894 9 00122 9 584 10 0223

101 11 1

Median mean of pulses in locations 69 amp 70 median= (70+70)2=70

Q1 median of lower half (lower half = 69 smallest pulses) Q1 = pulse in ordered position 35Q1 = 63

Q3 median of upper half (upper half = 69 largest pulses) Q3= pulse in position 35 from the high end Q3=78

Below are the weights of 31 linemen on the NCSU football team What is the

value of the first quartile Q1

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 287

2 2575

3 2635

4 2625

Interquartile range another measure of spread

lower quartile Q1

middle quartile median upper quartile Q3

interquartile range (IQR)

IQR = Q3 ndash Q1

measures spread of middle 50 of the data

Example beginning pulse rates

Q3 = 78 Q1 = 63

IQR = 78 ndash 63 = 15

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 20: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Histograms Showing Different Centers

0

10

20

30

40

50

60

70

0lt2 2lt4 4lt6 6lt8 8lt10 10lt12 12lt14 14lt16 16lt18

0

10

20

30

40

50

60

70

0lt2 2lt4 4lt6 6lt8 8lt10 10lt12 12lt14 14lt16 16lt18

Histograms - Same Center Different Spread

0

10

20

30

40

50

60

70

0lt2

2lt4

4lt6

6lt8

8lt10

10lt12

12lt14

14lt16

16lt18

0

10

20

30

40

50

60

70

0lt2 2lt4 4lt6 6lt8 8lt10 10lt12 12lt14 14lt16 16lt18

Histograms Shape

A distribution is symmetric if the right and left

sides of the histogram are approximately mirror

images of each other

Symmetric distribution

Complex multimodal distribution

Not all distributions have a simple overall shape

especially when there are few observations

Skewed distribution

A distribution is skewed to the right if the right

side of the histogram (side with larger values)

extends much farther out than the left side It is

skewed to the left if the left side of the histogram

extends much farther out than the right side

Shape (cont)Female heart attack patients in New York state

Age left-skewed Cost right-skewed

Shape (cont) outliersAll 200 m Races 202 secs or less

192 1926193219381944 195 1956196219681974 198 1986199219982004 201 20160

10

20

30

40

50

60

200 m Races 202 secs or less (approx 700)

TIMES

Fre

qu

ency Usain Bolt

2008 1930Michael Johnson1996 1932

Alaska Florida

Shape (cont) Outliers

An important kind of deviation is an outlier Outliers are observations

that lie outside the overall pattern of a distribution Always look for

outliers and try to explain them

The overall pattern is fairly

symmetrical except for 2

states clearly not belonging

to the main trend Alaska

and Florida have unusual

representation of the

elderly in their population

A large gap in the

distribution is typically a

sign of an outlier

Excel Example 2012-13 NFL Salaries

3694

80

1273

609

231

2177

738

462

3081

867

692

3985

996

923

4890

126

154

5794

255

385

6698

384

615

7602

513

846

8506

643

077

9410

772

308

1031

4901

54

1121

9030

77

1212

3160

1302

7289

23

1393

1418

46

1483

5547

69

1573

9676

92

1664

3806

15

1754

7935

38

0

100

200

300

400

500

600

700

800

900

1000

Histogram

Bin

Fre

qu

ency

Statcrunch Example 2012-13 NFL Salaries

Heights of Students in Recent Stats Class (Bimodal)

ExampleGrades on a statistics exam

Data

75 66 77 66 64 73 91 65 59 86 61 86 61

58 70 77 80 58 94 78 62 79 83 54 52 45

82 48 67 55

Example-2Frequency Distribution of Grades

Class Limits Frequency40 up to 50

50 up to 60

60 up to 70

70 up to 80

80 up to 90

90 up to 100

Total

2

6

8

7

5

2

30

Example-3 Relative Frequency Distribution of Grades

Class Limits Relative Frequency40 up to 50

50 up to 60

60 up to 70

70 up to 80

80 up to 90

90 up to 100

230 = 067

630 = 200

830 = 267

730 = 233

530 = 167

230 = 067

Relative Frequency Histogram of Grades

005

10

15

20

25

30

40 50 60 70 80 90Grade

Rel

ativ

e fr

eque

ncy

100

Based on the histo-gram about what percent of the values are between 475 and 525

1 50

2 5

3 17

4 30

Stem and leaf displays Have the following general appearance

stem leaf

1 8 9

2 1 2 8 9 9

3 2 3 8 9

4 0 1

5 6 7

6 4

Example employee ages at a small company

18 21 22 19 32 33 40 41 56 57 64 28 29 29 38 39 stem 10rsquos digit leaf 1rsquos digit

18 stem=1 leaf=8 18 = 1 | 8

stem leaf

1 8 9

2 1 2 8 9 9

3 2 3 8 9

4 0 1

5 6 7

6 4

Suppose a 95 yr old is hiredstem leaf

1 8 9

2 1 2 8 9 9

3 2 3 8 9

4 0 1

5 6 7

6 4

7

8

9 5

Number of TD passes by NFL teams 2012-2013 season(stems are 10rsquos digit)

stem leaf

43

03247

2 6677789

2 01222233444

1 13467889

0 8

Pulse Rates n = 138

Stem Leaves 4 3 4 588 9 5 001233444 10 5 5556788899 23 6 00011111122233333344444 23 6 55556666667777788888888 16 7 00000112222334444 23 7 55555666666777888888999 10 8 0000112224 10 8 5555667789 4 9 0012 2 9 58 4 10 0223 10 1 11 1

AdvantagesDisadvantages of Stem-and-Leaf Displays

Advantages

1) each measurement displayed

2) ascending order in each stem row

3) relatively simple (data set not too large) Disadvantages

display becomes unwieldy for large data sets

Population of 185 US cities with between 100000 and 500000

Multiply stems by 100000

Back-to-back stem-and-leaf displays TD passes by NFL teams 1999-2000 2012-13multiply stems by 10

1999-2000 2012-13

2 4 03

6 3 7

2 3 24

6655 2 6677789

43322221100 2 01222233444

9998887666 1 67889

421 1 134

0 8

Below is a stem-and-leaf display for the pulse rates of 24 women at a health clinic How many pulses are between 67 and 77

Stems are 10rsquos digits

1 4

2 6

3 8

4 10

5 12

Other Graphical Methods for Data Time plots

plot observations in time order time on horizontal axis variable on vertical axis

Time series

measurements are taken at regular intervals (monthly unemployment quarterly GDP weather records electricity demand etc)

Heat maps word walls

Unemployment Rate by Educational Attainment

Water Use During Super Bowl XLV(Packers 31 Steelers 25)

Heat Maps

Word Wall (customer feedback)

Section 32Describing the Center of Data

Mean

Median

2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability (next section)

measures how ldquospread outrdquo the data is

Notation for Data Valuesand Sample Mean

1 2

1 2

3

The sample size is denoted by

For a variable denoted by its observations are denoted by

A common measure of center is the sample mean

The sample mean is denoted by

Shorte

n

n

y y yy

n

y

y y y y

y

n

1 21

1

ned expression for using the symbol

(uppercase Greek letter sigma)n

n

i

i n

i

i

y

y y y

yy

n

y

Simple Example of Sample Mean

Weekly TV viewing time in hours of 7 randomly selected 4th graders

19 40 16 12 10 6 and 97

1

7

1

19 40 16 12 10 6 9 112

11216

7 7

ii

ii

y

yy

Population Mean

1

population

population mea

Denoted by the Greek letter

is the size (for example =34000 for NCSU)

the value of is typically not known

we often use the sample mean

to estimat

n

e the unknown

N

ii

y

N N

y

N

value of

Connection Between Mean and Histogram

A histogram balances when supported at the mean Mean x = 1406

Histogram

0

10

20

30

40

50

60

70

118

5

125

5

132

5

139

5

146

5

153

5

16

05

Mo

re

Absences f rom Work

Fre

qu

en

cy

Frequency

The median anothermeasure of center

Given a set of n data values arranged in order of magnitude

Median= middle value n odd

mean of 2 middle values n even

Ex 2 4 6 8 10 n=5 median=6 Ex 2 4 6 8 n=4 median=(4+6)2=5

Student Pulse Rates (n=62)

38 59 60 60 62 62 63 63 64 64 65 67 68 70 70 70 70 70 70 70 71 71 72 72 73 74 74 75 75 75 75 76 77 77 77 77 78 78 79 79 80 80 80 84 84 85 85 87 90 90 91 92 93 94 94 95 96 96 96 98 98 103

Median = (75+76)2 = 755

The median splits the histogram into 2 halves of equal area

Mean balance pointMedian 50 area each half

mean 5526 years median 577years

Medians are used often

Year 2011 baseball salaries

Median $1450000 (max=$32000000 Alex Rodriguez min=$414000)

Median fan age MLB 45 NFL 43 NBA 41 NHL 39

Median existing home sales price May 2011 $166500 May 2010 $174600

Median household income (2008 dollars) 2009 $50221 2008 $52029

Examples Example n = 7

175 28 32 139 141 253 458 Example n = 7 (ordered) 28 32 139 141 175 253 458 Example n = 8

175 28 32 139 141 253 357 458

Example n =8 (ordered)

28 32 139 141 175 253 357 458

m = 141

m = (141+175)2 = 158

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496049604971524555467586

1 5245

2 49655

3 4960

4 4971

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496052455546497155877586

1 5245

2 49655

3 5546

4 4971

Properties of Mean Median1The mean and median are unique that is a

data set has only 1 mean and 1 median (the mean and median are not necessarily equal)

2The mean uses the value of every number in the data set the median does not

14

20 4 6Ex 2 4 6 8 5 5

4 2

21 4 6Ex 2 4 6 9 5 5

4 2

x m

x m

Example class pulse rates

53 64 67 67 70 76 77 77 78 83 84 85 85 89 90 90 90 90 91 96 98 103 140

23

1

23

844823

location 12th obs 85

ii

n

xx

m m

2010 2014 baseball salaries

2010

n = 845

mean = $3297828

median = $1330000

max = $33000000

2014

n = 848

mean = $3932912

median = $1456250

max = $28000000

>

Disadvantage of the mean

Can be greatly influenced by just a few observations that are much greater or much smaller than the rest of the data

Mean Median Maximum Baseball Salaries 1985 - 201419

85

1987

1989

1991

1993

1995

1997

1999

2001

2003

2005

2007

2009

2011

2013

200000

700000

1200000

1700000

2200000

2700000

3200000

3700000

0

5000000

10000000

15000000

20000000

25000000

30000000

35000000

Baseball Salaries Mean Median and Maximum 1985-2014

Mean Median Maximum

Year

Mea

n M

edia

n S

alar

y

Max

imu

m S

alar

y

Skewness comparing the mean and median

Skewed to the right (positively skewed) meangtmedian

53

490

102 7235 21 26 17 8 10 2 3 1 0 0 1

0

100

200

300

400

500

600

Freq

uenc

y

Salary ($1000s)

2011 Baseball Salaries

Skewed to the left negatively skewed

Mean lt median mean=78 median=87

Histogram of Exam Scores

0

10

20

30

20 30 40 50 60 70 80 90 100Exam Scores

Fre

qu

en

cy

Symmetric data

mean median approx equal

Bank Customers 1000-1100 am

0

5

10

15

20

Number of Customers

Fre

qu

en

cy

Section 33Describing Variability of Data

Standard Deviation

Using the Mean and Standard Deviation Together 68-95-997

Rule (Empirical Rule)

Recall 2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability

measures how ldquospread outrdquo the data is

Ways to measure variability

1 range=largest-smallest

ok sometimes in general too crude sensitive to one large or small obs

1

2 where

the middle is the mean

deviation of from the mean

( ) sum the deviations of all the s from

measure spread from the middle

i i

n

i ii

y

y y y

y y y y

1

( ) 0 always tells us nothingn

ii

y y

Example

1 2

1 2

1 2

1 2

sum of deviations from mean

49 51 50

( ) ( ) (49 50) (51 50) 1 1 0

0 100

Data set 1

Data set 2 50

( ) ( ) (0 50) (100 50) 50 50 0

x x x

x x x x

y y y

y y y y

The Sample Standard Deviation a measure of spread around the mean Square the deviation of each

observation from the mean find the square root of the ldquoaveragerdquo of these squared deviations

2

1

2

2 1

( )sample standard deviation

1

( )is called the sample variance

1

n

ii

n

ii

y ys

n

y ys

n

Calculations hellip

Mean = 634

Sum of squared deviations from mean = 852

(n minus 1) = 13 (n minus 1) is called degrees freedom (df)

s2 = variance = 85213 = 655 square inches

s = standard deviation = radic655 = 256 inches

Women height (inches)i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

2

1

2 )(1

1xx

ns

n

i

1 First calculate the variance s22 Then take the square root to get the

standard deviation s

2

1

)(1

1xx

ns

n

i

Meanplusmn 1 sd

Wersquoll never calculate these by hand so make sure to know how to get the standard deviation using your calculator Excel or other software

Population Standard Deviation

2

1

Denoted by the lower case Greek letter

is the size (for example =34000 for NCSU)

is the mean

( )population standard deviation

va

po

lue of typically not known

us

pulation

populatio

e

n

N

ii

N N

y

N

s

to estimate value of

Remarks

1 The standard deviation of a set of measurements is an estimate of the likely size of the chance error in a single measurement

Remarks (cont)

2 Note that s and s are always greater than or equal to zero

3 The larger the value of s (or s ) the greater the spread of the data

When does s=0 When does s =0

When all data values are the same

Remarks (cont)4 The standard deviation is the most

commonly used measure of risk in finance and businessndash Stocks Mutual Funds etc

5 Variance s2 sample variance 2 population variance Units are squared units of the original data square $ square gallons

Review Properties of s and s s and s are always greater than or

equal to 0

when does s = 0 s = 0 The larger the value of s (or s) the

greater the spread of the data the standard deviation of a set of

measurements is an estimate of the likely size of the chance error in a single measurement

Summary of Notation

2

SAMPLE

sample mean

sample median

sample variance

sample stand dev

y

m

s

s

2

POPULATION

population mean

population median

population variance

population stand dev

m

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

68-95-997 rule

Mean andStandard Deviation

(numerical)

Histogram(graphical)

68-95-997 rule

The 68-95-997 ruleIf the histogram of the data is

approximately bell-shaped then1) approximately of the measurements

are of the mean

that is in ( )

2) approximately of the measurement

68

within 1 standard deviation

95

within 2 standard deviation

s

are of the meas n

that is

y s y s

almost all

within 3 standard deviation

in ( 2 2 )

3) the measurements

are of the mean

that is in ( 3 3 )

s

y s y s

y s y s

68-95-997 rule 68 within 1 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

68

3434

y-s y y+s

68-95-997 rule 95 within 2 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

95

475 475

y-2s y y+2s

Example textbook costs

37548

4272

50

y

s

n

286 291 307 308 315 316 327328 340 342 346 347 348 348 349354 355 355 360 361 364 367 369371 373 377 380 381 382 385 385387 390 390 397 398 409 409 410418 422 424 425 426 428 433 434437 440 480

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( ) (33276 41820)

32percentage of data values in this interval 64

5068-95-997 rule 68

y s

y s y s

1 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 2 2 ) (29004 46092)

48percentage of data values in this interval 96

5068-95-997 rule 95

y s

y s y s

2 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 3 3 ) (24732 50364)

50percentage of data values in this interval 100

5068-95-997 rule 997

y s

y s y s

3 standard deviation interval about the mean

The best estimate of the standard deviation of the menrsquos weights

displayed in this dotplot is

1 10

2 15

3 20

4 40

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

Preceding slides Next

Z-scores Standardized Data Values

Measures the distance of a number from the mean in units of

the standard deviation

z-score corresponding to y

where

original data value

the sample mean

s the sample standard deviation

the z-score corresponding to

y yz

s

y

y

z y

Exam 1 y1 = 88 s1 = 6 your exam 1 score 91

Exam 2 y2 = 88 s2 = 10 your exam 2 score 92

Which score is better

1

2

91 88 3z 5

6 692 88 4

z 410 10

91 on exam 1 is better than 92 on exam 2

If data has mean and standard deviation

then standardizing a particular value of

indicates how many standard deviations

is above or below the mean

y s

y

y

y

Comparing SAT and ACT Scores

SAT Math Eleanorrsquos score 680

SAT mean =500 sd=100 ACT Math Geraldrsquos score 27

ACT mean=18 sd=6 Eleanorrsquos z-score z=(680-500)100=18 Geraldrsquos z-score z=(27-18)6=15 Eleanorrsquos score is better

Z-scores add to zeroStudentInstitutional Support to Athletic Depts For the 9 Public ACC

Schools 2013 ($ millions)

School Support y - ybar Z-score

Maryland 155 64 179

UVA 131 40 112

Louisville 109 18 050

UNC 92 01 003

VaTech 79 -12 -034

FSU 79 -12 -034

GaTech 71 -20 -056

NCSU 65 -26 -073

Clemson 38 -53 -147

Mean=91000 s=35697

Sum = 0 Sum = 0

Recently the mean tuition at 4-yr public collegesuniversities in the US was $6185 with a standard deviation of $1804 In NC the mean tuition was $4320 What is NCrsquos z-score

1 103

2 -103

3 239

4 1865

5 -1865

Section 34Measures of Position (also called Measures of Relative Standing)

Quartiles

5-Number Summary

Interquartile Range Another Measure of Spread

Boxplots

m = median = 34

Q1= first quartile = 23

Q3= third quartile = 42

1 1 062 2 123 3 164 4 195 5 156 6 217 7 238 6 239 5 2510 4 2811 3 2912 2 3313 1 3414 2 3615 3 3716 4 3817 5 3918 6 4119 7 4220 6 4521 5 4722 4 4923 3 5324 2 5625 1 61

Quartiles Measuring spread by examining the middleThe first quartile Q1 is the value in the

sample that has 25 of the data at or

below it (Q1 is the median of the lower

half of the sorted data)

The third quartile Q3 is the value in the

sample that has 75 of the data at or

below it (Q3 is the median of the upper

half of the sorted data)

Quartiles and median divide data into 4 pieces

Q1 M Q3

14 14 14 14

Quartiles are common measures of spread

httpoirpncsueduiradmit

httpoirpncsueduunivpeer

University of Southern California

Economic Value of College Majors

Rules for Calculating QuartilesStep 1 find the median of all the data (the median divides the data in half)

Step 2a find the median of the lower half this median is Q1Step 2b find the median of the upper half this median is Q3

Importantwhen n is odd include the overall median in both halveswhen n is even do not include the overall median in either half

Example 2 4 6 8 10 12 14 16 18 20 n = 10

Median m = (10+12)2 = 222 = 11

Q1 median of lower half 2 4 6 8 10

Q1 = 6

Q3 median of upper half 12 14 16 18 20

Q3 = 16

11

Pulse Rates n = 138

Stem Leaves4

3 4 5889 5 00123344410 5 555678889923 6 0001111112223333334444423 6 5555666666777778888888816 7 0000011222233444423 7 5555566666677788888899910 8 000011222410 8 55556677894 9 00122 9 584 10 0223

101 11 1

Median mean of pulses in locations 69 amp 70 median= (70+70)2=70

Q1 median of lower half (lower half = 69 smallest pulses) Q1 = pulse in ordered position 35Q1 = 63

Q3 median of upper half (upper half = 69 largest pulses) Q3= pulse in position 35 from the high end Q3=78

Below are the weights of 31 linemen on the NCSU football team What is the

value of the first quartile Q1

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 287

2 2575

3 2635

4 2625

Interquartile range another measure of spread

lower quartile Q1

middle quartile median upper quartile Q3

interquartile range (IQR)

IQR = Q3 ndash Q1

measures spread of middle 50 of the data

Example beginning pulse rates

Q3 = 78 Q1 = 63

IQR = 78 ndash 63 = 15

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 21: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Histograms - Same Center Different Spread

0

10

20

30

40

50

60

70

0lt2

2lt4

4lt6

6lt8

8lt10

10lt12

12lt14

14lt16

16lt18

0

10

20

30

40

50

60

70

0lt2 2lt4 4lt6 6lt8 8lt10 10lt12 12lt14 14lt16 16lt18

Histograms Shape

A distribution is symmetric if the right and left

sides of the histogram are approximately mirror

images of each other

Symmetric distribution

Complex multimodal distribution

Not all distributions have a simple overall shape

especially when there are few observations

Skewed distribution

A distribution is skewed to the right if the right

side of the histogram (side with larger values)

extends much farther out than the left side It is

skewed to the left if the left side of the histogram

extends much farther out than the right side

Shape (cont)Female heart attack patients in New York state

Age left-skewed Cost right-skewed

Shape (cont) outliersAll 200 m Races 202 secs or less

192 1926193219381944 195 1956196219681974 198 1986199219982004 201 20160

10

20

30

40

50

60

200 m Races 202 secs or less (approx 700)

TIMES

Fre

qu

ency Usain Bolt

2008 1930Michael Johnson1996 1932

Alaska Florida

Shape (cont) Outliers

An important kind of deviation is an outlier Outliers are observations

that lie outside the overall pattern of a distribution Always look for

outliers and try to explain them

The overall pattern is fairly

symmetrical except for 2

states clearly not belonging

to the main trend Alaska

and Florida have unusual

representation of the

elderly in their population

A large gap in the

distribution is typically a

sign of an outlier

Excel Example 2012-13 NFL Salaries

3694

80

1273

609

231

2177

738

462

3081

867

692

3985

996

923

4890

126

154

5794

255

385

6698

384

615

7602

513

846

8506

643

077

9410

772

308

1031

4901

54

1121

9030

77

1212

3160

1302

7289

23

1393

1418

46

1483

5547

69

1573

9676

92

1664

3806

15

1754

7935

38

0

100

200

300

400

500

600

700

800

900

1000

Histogram

Bin

Fre

qu

ency

Statcrunch Example 2012-13 NFL Salaries

Heights of Students in Recent Stats Class (Bimodal)

ExampleGrades on a statistics exam

Data

75 66 77 66 64 73 91 65 59 86 61 86 61

58 70 77 80 58 94 78 62 79 83 54 52 45

82 48 67 55

Example-2Frequency Distribution of Grades

Class Limits Frequency40 up to 50

50 up to 60

60 up to 70

70 up to 80

80 up to 90

90 up to 100

Total

2

6

8

7

5

2

30

Example-3 Relative Frequency Distribution of Grades

Class Limits Relative Frequency40 up to 50

50 up to 60

60 up to 70

70 up to 80

80 up to 90

90 up to 100

230 = 067

630 = 200

830 = 267

730 = 233

530 = 167

230 = 067

Relative Frequency Histogram of Grades

005

10

15

20

25

30

40 50 60 70 80 90Grade

Rel

ativ

e fr

eque

ncy

100

Based on the histo-gram about what percent of the values are between 475 and 525

1 50

2 5

3 17

4 30

Stem and leaf displays Have the following general appearance

stem leaf

1 8 9

2 1 2 8 9 9

3 2 3 8 9

4 0 1

5 6 7

6 4

Example employee ages at a small company

18 21 22 19 32 33 40 41 56 57 64 28 29 29 38 39 stem 10rsquos digit leaf 1rsquos digit

18 stem=1 leaf=8 18 = 1 | 8

stem leaf

1 8 9

2 1 2 8 9 9

3 2 3 8 9

4 0 1

5 6 7

6 4

Suppose a 95 yr old is hiredstem leaf

1 8 9

2 1 2 8 9 9

3 2 3 8 9

4 0 1

5 6 7

6 4

7

8

9 5

Number of TD passes by NFL teams 2012-2013 season(stems are 10rsquos digit)

stem leaf

43

03247

2 6677789

2 01222233444

1 13467889

0 8

Pulse Rates n = 138

Stem Leaves 4 3 4 588 9 5 001233444 10 5 5556788899 23 6 00011111122233333344444 23 6 55556666667777788888888 16 7 00000112222334444 23 7 55555666666777888888999 10 8 0000112224 10 8 5555667789 4 9 0012 2 9 58 4 10 0223 10 1 11 1

AdvantagesDisadvantages of Stem-and-Leaf Displays

Advantages

1) each measurement displayed

2) ascending order in each stem row

3) relatively simple (data set not too large) Disadvantages

display becomes unwieldy for large data sets

Population of 185 US cities with between 100000 and 500000

Multiply stems by 100000

Back-to-back stem-and-leaf displays TD passes by NFL teams 1999-2000 2012-13multiply stems by 10

1999-2000 2012-13

2 4 03

6 3 7

2 3 24

6655 2 6677789

43322221100 2 01222233444

9998887666 1 67889

421 1 134

0 8

Below is a stem-and-leaf display for the pulse rates of 24 women at a health clinic How many pulses are between 67 and 77

Stems are 10rsquos digits

1 4

2 6

3 8

4 10

5 12

Other Graphical Methods for Data Time plots

plot observations in time order time on horizontal axis variable on vertical axis

Time series

measurements are taken at regular intervals (monthly unemployment quarterly GDP weather records electricity demand etc)

Heat maps word walls

Unemployment Rate by Educational Attainment

Water Use During Super Bowl XLV(Packers 31 Steelers 25)

Heat Maps

Word Wall (customer feedback)

Section 32Describing the Center of Data

Mean

Median

2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability (next section)

measures how ldquospread outrdquo the data is

Notation for Data Valuesand Sample Mean

1 2

1 2

3

The sample size is denoted by

For a variable denoted by its observations are denoted by

A common measure of center is the sample mean

The sample mean is denoted by

Shorte

n

n

y y yy

n

y

y y y y

y

n

1 21

1

ned expression for using the symbol

(uppercase Greek letter sigma)n

n

i

i n

i

i

y

y y y

yy

n

y

Simple Example of Sample Mean

Weekly TV viewing time in hours of 7 randomly selected 4th graders

19 40 16 12 10 6 and 97

1

7

1

19 40 16 12 10 6 9 112

11216

7 7

ii

ii

y

yy

Population Mean

1

population

population mea

Denoted by the Greek letter

is the size (for example =34000 for NCSU)

the value of is typically not known

we often use the sample mean

to estimat

n

e the unknown

N

ii

y

N N

y

N

value of

Connection Between Mean and Histogram

A histogram balances when supported at the mean Mean x = 1406

Histogram

0

10

20

30

40

50

60

70

118

5

125

5

132

5

139

5

146

5

153

5

16

05

Mo

re

Absences f rom Work

Fre

qu

en

cy

Frequency

The median anothermeasure of center

Given a set of n data values arranged in order of magnitude

Median= middle value n odd

mean of 2 middle values n even

Ex 2 4 6 8 10 n=5 median=6 Ex 2 4 6 8 n=4 median=(4+6)2=5

Student Pulse Rates (n=62)

38 59 60 60 62 62 63 63 64 64 65 67 68 70 70 70 70 70 70 70 71 71 72 72 73 74 74 75 75 75 75 76 77 77 77 77 78 78 79 79 80 80 80 84 84 85 85 87 90 90 91 92 93 94 94 95 96 96 96 98 98 103

Median = (75+76)2 = 755

The median splits the histogram into 2 halves of equal area

Mean balance pointMedian 50 area each half

mean 5526 years median 577years

Medians are used often

Year 2011 baseball salaries

Median $1450000 (max=$32000000 Alex Rodriguez min=$414000)

Median fan age MLB 45 NFL 43 NBA 41 NHL 39

Median existing home sales price May 2011 $166500 May 2010 $174600

Median household income (2008 dollars) 2009 $50221 2008 $52029

Examples Example n = 7

175 28 32 139 141 253 458 Example n = 7 (ordered) 28 32 139 141 175 253 458 Example n = 8

175 28 32 139 141 253 357 458

Example n =8 (ordered)

28 32 139 141 175 253 357 458

m = 141

m = (141+175)2 = 158

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496049604971524555467586

1 5245

2 49655

3 4960

4 4971

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496052455546497155877586

1 5245

2 49655

3 5546

4 4971

Properties of Mean Median1The mean and median are unique that is a

data set has only 1 mean and 1 median (the mean and median are not necessarily equal)

2The mean uses the value of every number in the data set the median does not

14

20 4 6Ex 2 4 6 8 5 5

4 2

21 4 6Ex 2 4 6 9 5 5

4 2

x m

x m

Example class pulse rates

53 64 67 67 70 76 77 77 78 83 84 85 85 89 90 90 90 90 91 96 98 103 140

23

1

23

844823

location 12th obs 85

ii

n

xx

m m

2010 2014 baseball salaries

2010

n = 845

mean = $3297828

median = $1330000

max = $33000000

2014

n = 848

mean = $3932912

median = $1456250

max = $28000000

>

Disadvantage of the mean

Can be greatly influenced by just a few observations that are much greater or much smaller than the rest of the data

Mean Median Maximum Baseball Salaries 1985 - 201419

85

1987

1989

1991

1993

1995

1997

1999

2001

2003

2005

2007

2009

2011

2013

200000

700000

1200000

1700000

2200000

2700000

3200000

3700000

0

5000000

10000000

15000000

20000000

25000000

30000000

35000000

Baseball Salaries Mean Median and Maximum 1985-2014

Mean Median Maximum

Year

Mea

n M

edia

n S

alar

y

Max

imu

m S

alar

y

Skewness comparing the mean and median

Skewed to the right (positively skewed) meangtmedian

53

490

102 7235 21 26 17 8 10 2 3 1 0 0 1

0

100

200

300

400

500

600

Freq

uenc

y

Salary ($1000s)

2011 Baseball Salaries

Skewed to the left negatively skewed

Mean lt median mean=78 median=87

Histogram of Exam Scores

0

10

20

30

20 30 40 50 60 70 80 90 100Exam Scores

Fre

qu

en

cy

Symmetric data

mean median approx equal

Bank Customers 1000-1100 am

0

5

10

15

20

Number of Customers

Fre

qu

en

cy

Section 33Describing Variability of Data

Standard Deviation

Using the Mean and Standard Deviation Together 68-95-997

Rule (Empirical Rule)

Recall 2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability

measures how ldquospread outrdquo the data is

Ways to measure variability

1 range=largest-smallest

ok sometimes in general too crude sensitive to one large or small obs

1

2 where

the middle is the mean

deviation of from the mean

( ) sum the deviations of all the s from

measure spread from the middle

i i

n

i ii

y

y y y

y y y y

1

( ) 0 always tells us nothingn

ii

y y

Example

1 2

1 2

1 2

1 2

sum of deviations from mean

49 51 50

( ) ( ) (49 50) (51 50) 1 1 0

0 100

Data set 1

Data set 2 50

( ) ( ) (0 50) (100 50) 50 50 0

x x x

x x x x

y y y

y y y y

The Sample Standard Deviation a measure of spread around the mean Square the deviation of each

observation from the mean find the square root of the ldquoaveragerdquo of these squared deviations

2

1

2

2 1

( )sample standard deviation

1

( )is called the sample variance

1

n

ii

n

ii

y ys

n

y ys

n

Calculations hellip

Mean = 634

Sum of squared deviations from mean = 852

(n minus 1) = 13 (n minus 1) is called degrees freedom (df)

s2 = variance = 85213 = 655 square inches

s = standard deviation = radic655 = 256 inches

Women height (inches)i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

2

1

2 )(1

1xx

ns

n

i

1 First calculate the variance s22 Then take the square root to get the

standard deviation s

2

1

)(1

1xx

ns

n

i

Meanplusmn 1 sd

Wersquoll never calculate these by hand so make sure to know how to get the standard deviation using your calculator Excel or other software

Population Standard Deviation

2

1

Denoted by the lower case Greek letter

is the size (for example =34000 for NCSU)

is the mean

( )population standard deviation

va

po

lue of typically not known

us

pulation

populatio

e

n

N

ii

N N

y

N

s

to estimate value of

Remarks

1 The standard deviation of a set of measurements is an estimate of the likely size of the chance error in a single measurement

Remarks (cont)

2 Note that s and s are always greater than or equal to zero

3 The larger the value of s (or s ) the greater the spread of the data

When does s=0 When does s =0

When all data values are the same

Remarks (cont)4 The standard deviation is the most

commonly used measure of risk in finance and businessndash Stocks Mutual Funds etc

5 Variance s2 sample variance 2 population variance Units are squared units of the original data square $ square gallons

Review Properties of s and s s and s are always greater than or

equal to 0

when does s = 0 s = 0 The larger the value of s (or s) the

greater the spread of the data the standard deviation of a set of

measurements is an estimate of the likely size of the chance error in a single measurement

Summary of Notation

2

SAMPLE

sample mean

sample median

sample variance

sample stand dev

y

m

s

s

2

POPULATION

population mean

population median

population variance

population stand dev

m

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

68-95-997 rule

Mean andStandard Deviation

(numerical)

Histogram(graphical)

68-95-997 rule

The 68-95-997 ruleIf the histogram of the data is

approximately bell-shaped then1) approximately of the measurements

are of the mean

that is in ( )

2) approximately of the measurement

68

within 1 standard deviation

95

within 2 standard deviation

s

are of the meas n

that is

y s y s

almost all

within 3 standard deviation

in ( 2 2 )

3) the measurements

are of the mean

that is in ( 3 3 )

s

y s y s

y s y s

68-95-997 rule 68 within 1 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

68

3434

y-s y y+s

68-95-997 rule 95 within 2 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

95

475 475

y-2s y y+2s

Example textbook costs

37548

4272

50

y

s

n

286 291 307 308 315 316 327328 340 342 346 347 348 348 349354 355 355 360 361 364 367 369371 373 377 380 381 382 385 385387 390 390 397 398 409 409 410418 422 424 425 426 428 433 434437 440 480

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( ) (33276 41820)

32percentage of data values in this interval 64

5068-95-997 rule 68

y s

y s y s

1 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 2 2 ) (29004 46092)

48percentage of data values in this interval 96

5068-95-997 rule 95

y s

y s y s

2 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 3 3 ) (24732 50364)

50percentage of data values in this interval 100

5068-95-997 rule 997

y s

y s y s

3 standard deviation interval about the mean

The best estimate of the standard deviation of the menrsquos weights

displayed in this dotplot is

1 10

2 15

3 20

4 40

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

Preceding slides Next

Z-scores Standardized Data Values

Measures the distance of a number from the mean in units of

the standard deviation

z-score corresponding to y

where

original data value

the sample mean

s the sample standard deviation

the z-score corresponding to

y yz

s

y

y

z y

Exam 1 y1 = 88 s1 = 6 your exam 1 score 91

Exam 2 y2 = 88 s2 = 10 your exam 2 score 92

Which score is better

1

2

91 88 3z 5

6 692 88 4

z 410 10

91 on exam 1 is better than 92 on exam 2

If data has mean and standard deviation

then standardizing a particular value of

indicates how many standard deviations

is above or below the mean

y s

y

y

y

Comparing SAT and ACT Scores

SAT Math Eleanorrsquos score 680

SAT mean =500 sd=100 ACT Math Geraldrsquos score 27

ACT mean=18 sd=6 Eleanorrsquos z-score z=(680-500)100=18 Geraldrsquos z-score z=(27-18)6=15 Eleanorrsquos score is better

Z-scores add to zeroStudentInstitutional Support to Athletic Depts For the 9 Public ACC

Schools 2013 ($ millions)

School Support y - ybar Z-score

Maryland 155 64 179

UVA 131 40 112

Louisville 109 18 050

UNC 92 01 003

VaTech 79 -12 -034

FSU 79 -12 -034

GaTech 71 -20 -056

NCSU 65 -26 -073

Clemson 38 -53 -147

Mean=91000 s=35697

Sum = 0 Sum = 0

Recently the mean tuition at 4-yr public collegesuniversities in the US was $6185 with a standard deviation of $1804 In NC the mean tuition was $4320 What is NCrsquos z-score

1 103

2 -103

3 239

4 1865

5 -1865

Section 34Measures of Position (also called Measures of Relative Standing)

Quartiles

5-Number Summary

Interquartile Range Another Measure of Spread

Boxplots

m = median = 34

Q1= first quartile = 23

Q3= third quartile = 42

1 1 062 2 123 3 164 4 195 5 156 6 217 7 238 6 239 5 2510 4 2811 3 2912 2 3313 1 3414 2 3615 3 3716 4 3817 5 3918 6 4119 7 4220 6 4521 5 4722 4 4923 3 5324 2 5625 1 61

Quartiles Measuring spread by examining the middleThe first quartile Q1 is the value in the

sample that has 25 of the data at or

below it (Q1 is the median of the lower

half of the sorted data)

The third quartile Q3 is the value in the

sample that has 75 of the data at or

below it (Q3 is the median of the upper

half of the sorted data)

Quartiles and median divide data into 4 pieces

Q1 M Q3

14 14 14 14

Quartiles are common measures of spread

httpoirpncsueduiradmit

httpoirpncsueduunivpeer

University of Southern California

Economic Value of College Majors

Rules for Calculating QuartilesStep 1 find the median of all the data (the median divides the data in half)

Step 2a find the median of the lower half this median is Q1Step 2b find the median of the upper half this median is Q3

Importantwhen n is odd include the overall median in both halveswhen n is even do not include the overall median in either half

Example 2 4 6 8 10 12 14 16 18 20 n = 10

Median m = (10+12)2 = 222 = 11

Q1 median of lower half 2 4 6 8 10

Q1 = 6

Q3 median of upper half 12 14 16 18 20

Q3 = 16

11

Pulse Rates n = 138

Stem Leaves4

3 4 5889 5 00123344410 5 555678889923 6 0001111112223333334444423 6 5555666666777778888888816 7 0000011222233444423 7 5555566666677788888899910 8 000011222410 8 55556677894 9 00122 9 584 10 0223

101 11 1

Median mean of pulses in locations 69 amp 70 median= (70+70)2=70

Q1 median of lower half (lower half = 69 smallest pulses) Q1 = pulse in ordered position 35Q1 = 63

Q3 median of upper half (upper half = 69 largest pulses) Q3= pulse in position 35 from the high end Q3=78

Below are the weights of 31 linemen on the NCSU football team What is the

value of the first quartile Q1

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 287

2 2575

3 2635

4 2625

Interquartile range another measure of spread

lower quartile Q1

middle quartile median upper quartile Q3

interquartile range (IQR)

IQR = Q3 ndash Q1

measures spread of middle 50 of the data

Example beginning pulse rates

Q3 = 78 Q1 = 63

IQR = 78 ndash 63 = 15

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 22: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Histograms Shape

A distribution is symmetric if the right and left

sides of the histogram are approximately mirror

images of each other

Symmetric distribution

Complex multimodal distribution

Not all distributions have a simple overall shape

especially when there are few observations

Skewed distribution

A distribution is skewed to the right if the right

side of the histogram (side with larger values)

extends much farther out than the left side It is

skewed to the left if the left side of the histogram

extends much farther out than the right side

Shape (cont)Female heart attack patients in New York state

Age left-skewed Cost right-skewed

Shape (cont) outliersAll 200 m Races 202 secs or less

192 1926193219381944 195 1956196219681974 198 1986199219982004 201 20160

10

20

30

40

50

60

200 m Races 202 secs or less (approx 700)

TIMES

Fre

qu

ency Usain Bolt

2008 1930Michael Johnson1996 1932

Alaska Florida

Shape (cont) Outliers

An important kind of deviation is an outlier Outliers are observations

that lie outside the overall pattern of a distribution Always look for

outliers and try to explain them

The overall pattern is fairly

symmetrical except for 2

states clearly not belonging

to the main trend Alaska

and Florida have unusual

representation of the

elderly in their population

A large gap in the

distribution is typically a

sign of an outlier

Excel Example 2012-13 NFL Salaries

3694

80

1273

609

231

2177

738

462

3081

867

692

3985

996

923

4890

126

154

5794

255

385

6698

384

615

7602

513

846

8506

643

077

9410

772

308

1031

4901

54

1121

9030

77

1212

3160

1302

7289

23

1393

1418

46

1483

5547

69

1573

9676

92

1664

3806

15

1754

7935

38

0

100

200

300

400

500

600

700

800

900

1000

Histogram

Bin

Fre

qu

ency

Statcrunch Example 2012-13 NFL Salaries

Heights of Students in Recent Stats Class (Bimodal)

ExampleGrades on a statistics exam

Data

75 66 77 66 64 73 91 65 59 86 61 86 61

58 70 77 80 58 94 78 62 79 83 54 52 45

82 48 67 55

Example-2Frequency Distribution of Grades

Class Limits Frequency40 up to 50

50 up to 60

60 up to 70

70 up to 80

80 up to 90

90 up to 100

Total

2

6

8

7

5

2

30

Example-3 Relative Frequency Distribution of Grades

Class Limits Relative Frequency40 up to 50

50 up to 60

60 up to 70

70 up to 80

80 up to 90

90 up to 100

230 = 067

630 = 200

830 = 267

730 = 233

530 = 167

230 = 067

Relative Frequency Histogram of Grades

005

10

15

20

25

30

40 50 60 70 80 90Grade

Rel

ativ

e fr

eque

ncy

100

Based on the histo-gram about what percent of the values are between 475 and 525

1 50

2 5

3 17

4 30

Stem and leaf displays Have the following general appearance

stem leaf

1 8 9

2 1 2 8 9 9

3 2 3 8 9

4 0 1

5 6 7

6 4

Example employee ages at a small company

18 21 22 19 32 33 40 41 56 57 64 28 29 29 38 39 stem 10rsquos digit leaf 1rsquos digit

18 stem=1 leaf=8 18 = 1 | 8

stem leaf

1 8 9

2 1 2 8 9 9

3 2 3 8 9

4 0 1

5 6 7

6 4

Suppose a 95 yr old is hiredstem leaf

1 8 9

2 1 2 8 9 9

3 2 3 8 9

4 0 1

5 6 7

6 4

7

8

9 5

Number of TD passes by NFL teams 2012-2013 season(stems are 10rsquos digit)

stem leaf

43

03247

2 6677789

2 01222233444

1 13467889

0 8

Pulse Rates n = 138

Stem Leaves 4 3 4 588 9 5 001233444 10 5 5556788899 23 6 00011111122233333344444 23 6 55556666667777788888888 16 7 00000112222334444 23 7 55555666666777888888999 10 8 0000112224 10 8 5555667789 4 9 0012 2 9 58 4 10 0223 10 1 11 1

AdvantagesDisadvantages of Stem-and-Leaf Displays

Advantages

1) each measurement displayed

2) ascending order in each stem row

3) relatively simple (data set not too large) Disadvantages

display becomes unwieldy for large data sets

Population of 185 US cities with between 100000 and 500000

Multiply stems by 100000

Back-to-back stem-and-leaf displays TD passes by NFL teams 1999-2000 2012-13multiply stems by 10

1999-2000 2012-13

2 4 03

6 3 7

2 3 24

6655 2 6677789

43322221100 2 01222233444

9998887666 1 67889

421 1 134

0 8

Below is a stem-and-leaf display for the pulse rates of 24 women at a health clinic How many pulses are between 67 and 77

Stems are 10rsquos digits

1 4

2 6

3 8

4 10

5 12

Other Graphical Methods for Data Time plots

plot observations in time order time on horizontal axis variable on vertical axis

Time series

measurements are taken at regular intervals (monthly unemployment quarterly GDP weather records electricity demand etc)

Heat maps word walls

Unemployment Rate by Educational Attainment

Water Use During Super Bowl XLV(Packers 31 Steelers 25)

Heat Maps

Word Wall (customer feedback)

Section 32Describing the Center of Data

Mean

Median

2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability (next section)

measures how ldquospread outrdquo the data is

Notation for Data Valuesand Sample Mean

1 2

1 2

3

The sample size is denoted by

For a variable denoted by its observations are denoted by

A common measure of center is the sample mean

The sample mean is denoted by

Shorte

n

n

y y yy

n

y

y y y y

y

n

1 21

1

ned expression for using the symbol

(uppercase Greek letter sigma)n

n

i

i n

i

i

y

y y y

yy

n

y

Simple Example of Sample Mean

Weekly TV viewing time in hours of 7 randomly selected 4th graders

19 40 16 12 10 6 and 97

1

7

1

19 40 16 12 10 6 9 112

11216

7 7

ii

ii

y

yy

Population Mean

1

population

population mea

Denoted by the Greek letter

is the size (for example =34000 for NCSU)

the value of is typically not known

we often use the sample mean

to estimat

n

e the unknown

N

ii

y

N N

y

N

value of

Connection Between Mean and Histogram

A histogram balances when supported at the mean Mean x = 1406

Histogram

0

10

20

30

40

50

60

70

118

5

125

5

132

5

139

5

146

5

153

5

16

05

Mo

re

Absences f rom Work

Fre

qu

en

cy

Frequency

The median anothermeasure of center

Given a set of n data values arranged in order of magnitude

Median= middle value n odd

mean of 2 middle values n even

Ex 2 4 6 8 10 n=5 median=6 Ex 2 4 6 8 n=4 median=(4+6)2=5

Student Pulse Rates (n=62)

38 59 60 60 62 62 63 63 64 64 65 67 68 70 70 70 70 70 70 70 71 71 72 72 73 74 74 75 75 75 75 76 77 77 77 77 78 78 79 79 80 80 80 84 84 85 85 87 90 90 91 92 93 94 94 95 96 96 96 98 98 103

Median = (75+76)2 = 755

The median splits the histogram into 2 halves of equal area

Mean balance pointMedian 50 area each half

mean 5526 years median 577years

Medians are used often

Year 2011 baseball salaries

Median $1450000 (max=$32000000 Alex Rodriguez min=$414000)

Median fan age MLB 45 NFL 43 NBA 41 NHL 39

Median existing home sales price May 2011 $166500 May 2010 $174600

Median household income (2008 dollars) 2009 $50221 2008 $52029

Examples Example n = 7

175 28 32 139 141 253 458 Example n = 7 (ordered) 28 32 139 141 175 253 458 Example n = 8

175 28 32 139 141 253 357 458

Example n =8 (ordered)

28 32 139 141 175 253 357 458

m = 141

m = (141+175)2 = 158

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496049604971524555467586

1 5245

2 49655

3 4960

4 4971

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496052455546497155877586

1 5245

2 49655

3 5546

4 4971

Properties of Mean Median1The mean and median are unique that is a

data set has only 1 mean and 1 median (the mean and median are not necessarily equal)

2The mean uses the value of every number in the data set the median does not

14

20 4 6Ex 2 4 6 8 5 5

4 2

21 4 6Ex 2 4 6 9 5 5

4 2

x m

x m

Example class pulse rates

53 64 67 67 70 76 77 77 78 83 84 85 85 89 90 90 90 90 91 96 98 103 140

23

1

23

844823

location 12th obs 85

ii

n

xx

m m

2010 2014 baseball salaries

2010

n = 845

mean = $3297828

median = $1330000

max = $33000000

2014

n = 848

mean = $3932912

median = $1456250

max = $28000000

>

Disadvantage of the mean

Can be greatly influenced by just a few observations that are much greater or much smaller than the rest of the data

Mean Median Maximum Baseball Salaries 1985 - 201419

85

1987

1989

1991

1993

1995

1997

1999

2001

2003

2005

2007

2009

2011

2013

200000

700000

1200000

1700000

2200000

2700000

3200000

3700000

0

5000000

10000000

15000000

20000000

25000000

30000000

35000000

Baseball Salaries Mean Median and Maximum 1985-2014

Mean Median Maximum

Year

Mea

n M

edia

n S

alar

y

Max

imu

m S

alar

y

Skewness comparing the mean and median

Skewed to the right (positively skewed) meangtmedian

53

490

102 7235 21 26 17 8 10 2 3 1 0 0 1

0

100

200

300

400

500

600

Freq

uenc

y

Salary ($1000s)

2011 Baseball Salaries

Skewed to the left negatively skewed

Mean lt median mean=78 median=87

Histogram of Exam Scores

0

10

20

30

20 30 40 50 60 70 80 90 100Exam Scores

Fre

qu

en

cy

Symmetric data

mean median approx equal

Bank Customers 1000-1100 am

0

5

10

15

20

Number of Customers

Fre

qu

en

cy

Section 33Describing Variability of Data

Standard Deviation

Using the Mean and Standard Deviation Together 68-95-997

Rule (Empirical Rule)

Recall 2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability

measures how ldquospread outrdquo the data is

Ways to measure variability

1 range=largest-smallest

ok sometimes in general too crude sensitive to one large or small obs

1

2 where

the middle is the mean

deviation of from the mean

( ) sum the deviations of all the s from

measure spread from the middle

i i

n

i ii

y

y y y

y y y y

1

( ) 0 always tells us nothingn

ii

y y

Example

1 2

1 2

1 2

1 2

sum of deviations from mean

49 51 50

( ) ( ) (49 50) (51 50) 1 1 0

0 100

Data set 1

Data set 2 50

( ) ( ) (0 50) (100 50) 50 50 0

x x x

x x x x

y y y

y y y y

The Sample Standard Deviation a measure of spread around the mean Square the deviation of each

observation from the mean find the square root of the ldquoaveragerdquo of these squared deviations

2

1

2

2 1

( )sample standard deviation

1

( )is called the sample variance

1

n

ii

n

ii

y ys

n

y ys

n

Calculations hellip

Mean = 634

Sum of squared deviations from mean = 852

(n minus 1) = 13 (n minus 1) is called degrees freedom (df)

s2 = variance = 85213 = 655 square inches

s = standard deviation = radic655 = 256 inches

Women height (inches)i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

2

1

2 )(1

1xx

ns

n

i

1 First calculate the variance s22 Then take the square root to get the

standard deviation s

2

1

)(1

1xx

ns

n

i

Meanplusmn 1 sd

Wersquoll never calculate these by hand so make sure to know how to get the standard deviation using your calculator Excel or other software

Population Standard Deviation

2

1

Denoted by the lower case Greek letter

is the size (for example =34000 for NCSU)

is the mean

( )population standard deviation

va

po

lue of typically not known

us

pulation

populatio

e

n

N

ii

N N

y

N

s

to estimate value of

Remarks

1 The standard deviation of a set of measurements is an estimate of the likely size of the chance error in a single measurement

Remarks (cont)

2 Note that s and s are always greater than or equal to zero

3 The larger the value of s (or s ) the greater the spread of the data

When does s=0 When does s =0

When all data values are the same

Remarks (cont)4 The standard deviation is the most

commonly used measure of risk in finance and businessndash Stocks Mutual Funds etc

5 Variance s2 sample variance 2 population variance Units are squared units of the original data square $ square gallons

Review Properties of s and s s and s are always greater than or

equal to 0

when does s = 0 s = 0 The larger the value of s (or s) the

greater the spread of the data the standard deviation of a set of

measurements is an estimate of the likely size of the chance error in a single measurement

Summary of Notation

2

SAMPLE

sample mean

sample median

sample variance

sample stand dev

y

m

s

s

2

POPULATION

population mean

population median

population variance

population stand dev

m

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

68-95-997 rule

Mean andStandard Deviation

(numerical)

Histogram(graphical)

68-95-997 rule

The 68-95-997 ruleIf the histogram of the data is

approximately bell-shaped then1) approximately of the measurements

are of the mean

that is in ( )

2) approximately of the measurement

68

within 1 standard deviation

95

within 2 standard deviation

s

are of the meas n

that is

y s y s

almost all

within 3 standard deviation

in ( 2 2 )

3) the measurements

are of the mean

that is in ( 3 3 )

s

y s y s

y s y s

68-95-997 rule 68 within 1 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

68

3434

y-s y y+s

68-95-997 rule 95 within 2 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

95

475 475

y-2s y y+2s

Example textbook costs

37548

4272

50

y

s

n

286 291 307 308 315 316 327328 340 342 346 347 348 348 349354 355 355 360 361 364 367 369371 373 377 380 381 382 385 385387 390 390 397 398 409 409 410418 422 424 425 426 428 433 434437 440 480

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( ) (33276 41820)

32percentage of data values in this interval 64

5068-95-997 rule 68

y s

y s y s

1 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 2 2 ) (29004 46092)

48percentage of data values in this interval 96

5068-95-997 rule 95

y s

y s y s

2 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 3 3 ) (24732 50364)

50percentage of data values in this interval 100

5068-95-997 rule 997

y s

y s y s

3 standard deviation interval about the mean

The best estimate of the standard deviation of the menrsquos weights

displayed in this dotplot is

1 10

2 15

3 20

4 40

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

Preceding slides Next

Z-scores Standardized Data Values

Measures the distance of a number from the mean in units of

the standard deviation

z-score corresponding to y

where

original data value

the sample mean

s the sample standard deviation

the z-score corresponding to

y yz

s

y

y

z y

Exam 1 y1 = 88 s1 = 6 your exam 1 score 91

Exam 2 y2 = 88 s2 = 10 your exam 2 score 92

Which score is better

1

2

91 88 3z 5

6 692 88 4

z 410 10

91 on exam 1 is better than 92 on exam 2

If data has mean and standard deviation

then standardizing a particular value of

indicates how many standard deviations

is above or below the mean

y s

y

y

y

Comparing SAT and ACT Scores

SAT Math Eleanorrsquos score 680

SAT mean =500 sd=100 ACT Math Geraldrsquos score 27

ACT mean=18 sd=6 Eleanorrsquos z-score z=(680-500)100=18 Geraldrsquos z-score z=(27-18)6=15 Eleanorrsquos score is better

Z-scores add to zeroStudentInstitutional Support to Athletic Depts For the 9 Public ACC

Schools 2013 ($ millions)

School Support y - ybar Z-score

Maryland 155 64 179

UVA 131 40 112

Louisville 109 18 050

UNC 92 01 003

VaTech 79 -12 -034

FSU 79 -12 -034

GaTech 71 -20 -056

NCSU 65 -26 -073

Clemson 38 -53 -147

Mean=91000 s=35697

Sum = 0 Sum = 0

Recently the mean tuition at 4-yr public collegesuniversities in the US was $6185 with a standard deviation of $1804 In NC the mean tuition was $4320 What is NCrsquos z-score

1 103

2 -103

3 239

4 1865

5 -1865

Section 34Measures of Position (also called Measures of Relative Standing)

Quartiles

5-Number Summary

Interquartile Range Another Measure of Spread

Boxplots

m = median = 34

Q1= first quartile = 23

Q3= third quartile = 42

1 1 062 2 123 3 164 4 195 5 156 6 217 7 238 6 239 5 2510 4 2811 3 2912 2 3313 1 3414 2 3615 3 3716 4 3817 5 3918 6 4119 7 4220 6 4521 5 4722 4 4923 3 5324 2 5625 1 61

Quartiles Measuring spread by examining the middleThe first quartile Q1 is the value in the

sample that has 25 of the data at or

below it (Q1 is the median of the lower

half of the sorted data)

The third quartile Q3 is the value in the

sample that has 75 of the data at or

below it (Q3 is the median of the upper

half of the sorted data)

Quartiles and median divide data into 4 pieces

Q1 M Q3

14 14 14 14

Quartiles are common measures of spread

httpoirpncsueduiradmit

httpoirpncsueduunivpeer

University of Southern California

Economic Value of College Majors

Rules for Calculating QuartilesStep 1 find the median of all the data (the median divides the data in half)

Step 2a find the median of the lower half this median is Q1Step 2b find the median of the upper half this median is Q3

Importantwhen n is odd include the overall median in both halveswhen n is even do not include the overall median in either half

Example 2 4 6 8 10 12 14 16 18 20 n = 10

Median m = (10+12)2 = 222 = 11

Q1 median of lower half 2 4 6 8 10

Q1 = 6

Q3 median of upper half 12 14 16 18 20

Q3 = 16

11

Pulse Rates n = 138

Stem Leaves4

3 4 5889 5 00123344410 5 555678889923 6 0001111112223333334444423 6 5555666666777778888888816 7 0000011222233444423 7 5555566666677788888899910 8 000011222410 8 55556677894 9 00122 9 584 10 0223

101 11 1

Median mean of pulses in locations 69 amp 70 median= (70+70)2=70

Q1 median of lower half (lower half = 69 smallest pulses) Q1 = pulse in ordered position 35Q1 = 63

Q3 median of upper half (upper half = 69 largest pulses) Q3= pulse in position 35 from the high end Q3=78

Below are the weights of 31 linemen on the NCSU football team What is the

value of the first quartile Q1

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 287

2 2575

3 2635

4 2625

Interquartile range another measure of spread

lower quartile Q1

middle quartile median upper quartile Q3

interquartile range (IQR)

IQR = Q3 ndash Q1

measures spread of middle 50 of the data

Example beginning pulse rates

Q3 = 78 Q1 = 63

IQR = 78 ndash 63 = 15

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 23: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Shape (cont)Female heart attack patients in New York state

Age left-skewed Cost right-skewed

Shape (cont) outliersAll 200 m Races 202 secs or less

192 1926193219381944 195 1956196219681974 198 1986199219982004 201 20160

10

20

30

40

50

60

200 m Races 202 secs or less (approx 700)

TIMES

Fre

qu

ency Usain Bolt

2008 1930Michael Johnson1996 1932

Alaska Florida

Shape (cont) Outliers

An important kind of deviation is an outlier Outliers are observations

that lie outside the overall pattern of a distribution Always look for

outliers and try to explain them

The overall pattern is fairly

symmetrical except for 2

states clearly not belonging

to the main trend Alaska

and Florida have unusual

representation of the

elderly in their population

A large gap in the

distribution is typically a

sign of an outlier

Excel Example 2012-13 NFL Salaries

3694

80

1273

609

231

2177

738

462

3081

867

692

3985

996

923

4890

126

154

5794

255

385

6698

384

615

7602

513

846

8506

643

077

9410

772

308

1031

4901

54

1121

9030

77

1212

3160

1302

7289

23

1393

1418

46

1483

5547

69

1573

9676

92

1664

3806

15

1754

7935

38

0

100

200

300

400

500

600

700

800

900

1000

Histogram

Bin

Fre

qu

ency

Statcrunch Example 2012-13 NFL Salaries

Heights of Students in Recent Stats Class (Bimodal)

ExampleGrades on a statistics exam

Data

75 66 77 66 64 73 91 65 59 86 61 86 61

58 70 77 80 58 94 78 62 79 83 54 52 45

82 48 67 55

Example-2Frequency Distribution of Grades

Class Limits Frequency40 up to 50

50 up to 60

60 up to 70

70 up to 80

80 up to 90

90 up to 100

Total

2

6

8

7

5

2

30

Example-3 Relative Frequency Distribution of Grades

Class Limits Relative Frequency40 up to 50

50 up to 60

60 up to 70

70 up to 80

80 up to 90

90 up to 100

230 = 067

630 = 200

830 = 267

730 = 233

530 = 167

230 = 067

Relative Frequency Histogram of Grades

005

10

15

20

25

30

40 50 60 70 80 90Grade

Rel

ativ

e fr

eque

ncy

100

Based on the histo-gram about what percent of the values are between 475 and 525

1 50

2 5

3 17

4 30

Stem and leaf displays Have the following general appearance

stem leaf

1 8 9

2 1 2 8 9 9

3 2 3 8 9

4 0 1

5 6 7

6 4

Example employee ages at a small company

18 21 22 19 32 33 40 41 56 57 64 28 29 29 38 39 stem 10rsquos digit leaf 1rsquos digit

18 stem=1 leaf=8 18 = 1 | 8

stem leaf

1 8 9

2 1 2 8 9 9

3 2 3 8 9

4 0 1

5 6 7

6 4

Suppose a 95 yr old is hiredstem leaf

1 8 9

2 1 2 8 9 9

3 2 3 8 9

4 0 1

5 6 7

6 4

7

8

9 5

Number of TD passes by NFL teams 2012-2013 season(stems are 10rsquos digit)

stem leaf

43

03247

2 6677789

2 01222233444

1 13467889

0 8

Pulse Rates n = 138

Stem Leaves 4 3 4 588 9 5 001233444 10 5 5556788899 23 6 00011111122233333344444 23 6 55556666667777788888888 16 7 00000112222334444 23 7 55555666666777888888999 10 8 0000112224 10 8 5555667789 4 9 0012 2 9 58 4 10 0223 10 1 11 1

AdvantagesDisadvantages of Stem-and-Leaf Displays

Advantages

1) each measurement displayed

2) ascending order in each stem row

3) relatively simple (data set not too large) Disadvantages

display becomes unwieldy for large data sets

Population of 185 US cities with between 100000 and 500000

Multiply stems by 100000

Back-to-back stem-and-leaf displays TD passes by NFL teams 1999-2000 2012-13multiply stems by 10

1999-2000 2012-13

2 4 03

6 3 7

2 3 24

6655 2 6677789

43322221100 2 01222233444

9998887666 1 67889

421 1 134

0 8

Below is a stem-and-leaf display for the pulse rates of 24 women at a health clinic How many pulses are between 67 and 77

Stems are 10rsquos digits

1 4

2 6

3 8

4 10

5 12

Other Graphical Methods for Data Time plots

plot observations in time order time on horizontal axis variable on vertical axis

Time series

measurements are taken at regular intervals (monthly unemployment quarterly GDP weather records electricity demand etc)

Heat maps word walls

Unemployment Rate by Educational Attainment

Water Use During Super Bowl XLV(Packers 31 Steelers 25)

Heat Maps

Word Wall (customer feedback)

Section 32Describing the Center of Data

Mean

Median

2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability (next section)

measures how ldquospread outrdquo the data is

Notation for Data Valuesand Sample Mean

1 2

1 2

3

The sample size is denoted by

For a variable denoted by its observations are denoted by

A common measure of center is the sample mean

The sample mean is denoted by

Shorte

n

n

y y yy

n

y

y y y y

y

n

1 21

1

ned expression for using the symbol

(uppercase Greek letter sigma)n

n

i

i n

i

i

y

y y y

yy

n

y

Simple Example of Sample Mean

Weekly TV viewing time in hours of 7 randomly selected 4th graders

19 40 16 12 10 6 and 97

1

7

1

19 40 16 12 10 6 9 112

11216

7 7

ii

ii

y

yy

Population Mean

1

population

population mea

Denoted by the Greek letter

is the size (for example =34000 for NCSU)

the value of is typically not known

we often use the sample mean

to estimat

n

e the unknown

N

ii

y

N N

y

N

value of

Connection Between Mean and Histogram

A histogram balances when supported at the mean Mean x = 1406

Histogram

0

10

20

30

40

50

60

70

118

5

125

5

132

5

139

5

146

5

153

5

16

05

Mo

re

Absences f rom Work

Fre

qu

en

cy

Frequency

The median anothermeasure of center

Given a set of n data values arranged in order of magnitude

Median= middle value n odd

mean of 2 middle values n even

Ex 2 4 6 8 10 n=5 median=6 Ex 2 4 6 8 n=4 median=(4+6)2=5

Student Pulse Rates (n=62)

38 59 60 60 62 62 63 63 64 64 65 67 68 70 70 70 70 70 70 70 71 71 72 72 73 74 74 75 75 75 75 76 77 77 77 77 78 78 79 79 80 80 80 84 84 85 85 87 90 90 91 92 93 94 94 95 96 96 96 98 98 103

Median = (75+76)2 = 755

The median splits the histogram into 2 halves of equal area

Mean balance pointMedian 50 area each half

mean 5526 years median 577years

Medians are used often

Year 2011 baseball salaries

Median $1450000 (max=$32000000 Alex Rodriguez min=$414000)

Median fan age MLB 45 NFL 43 NBA 41 NHL 39

Median existing home sales price May 2011 $166500 May 2010 $174600

Median household income (2008 dollars) 2009 $50221 2008 $52029

Examples Example n = 7

175 28 32 139 141 253 458 Example n = 7 (ordered) 28 32 139 141 175 253 458 Example n = 8

175 28 32 139 141 253 357 458

Example n =8 (ordered)

28 32 139 141 175 253 357 458

m = 141

m = (141+175)2 = 158

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496049604971524555467586

1 5245

2 49655

3 4960

4 4971

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496052455546497155877586

1 5245

2 49655

3 5546

4 4971

Properties of Mean Median1The mean and median are unique that is a

data set has only 1 mean and 1 median (the mean and median are not necessarily equal)

2The mean uses the value of every number in the data set the median does not

14

20 4 6Ex 2 4 6 8 5 5

4 2

21 4 6Ex 2 4 6 9 5 5

4 2

x m

x m

Example class pulse rates

53 64 67 67 70 76 77 77 78 83 84 85 85 89 90 90 90 90 91 96 98 103 140

23

1

23

844823

location 12th obs 85

ii

n

xx

m m

2010 2014 baseball salaries

2010

n = 845

mean = $3297828

median = $1330000

max = $33000000

2014

n = 848

mean = $3932912

median = $1456250

max = $28000000

>

Disadvantage of the mean

Can be greatly influenced by just a few observations that are much greater or much smaller than the rest of the data

Mean Median Maximum Baseball Salaries 1985 - 201419

85

1987

1989

1991

1993

1995

1997

1999

2001

2003

2005

2007

2009

2011

2013

200000

700000

1200000

1700000

2200000

2700000

3200000

3700000

0

5000000

10000000

15000000

20000000

25000000

30000000

35000000

Baseball Salaries Mean Median and Maximum 1985-2014

Mean Median Maximum

Year

Mea

n M

edia

n S

alar

y

Max

imu

m S

alar

y

Skewness comparing the mean and median

Skewed to the right (positively skewed) meangtmedian

53

490

102 7235 21 26 17 8 10 2 3 1 0 0 1

0

100

200

300

400

500

600

Freq

uenc

y

Salary ($1000s)

2011 Baseball Salaries

Skewed to the left negatively skewed

Mean lt median mean=78 median=87

Histogram of Exam Scores

0

10

20

30

20 30 40 50 60 70 80 90 100Exam Scores

Fre

qu

en

cy

Symmetric data

mean median approx equal

Bank Customers 1000-1100 am

0

5

10

15

20

Number of Customers

Fre

qu

en

cy

Section 33Describing Variability of Data

Standard Deviation

Using the Mean and Standard Deviation Together 68-95-997

Rule (Empirical Rule)

Recall 2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability

measures how ldquospread outrdquo the data is

Ways to measure variability

1 range=largest-smallest

ok sometimes in general too crude sensitive to one large or small obs

1

2 where

the middle is the mean

deviation of from the mean

( ) sum the deviations of all the s from

measure spread from the middle

i i

n

i ii

y

y y y

y y y y

1

( ) 0 always tells us nothingn

ii

y y

Example

1 2

1 2

1 2

1 2

sum of deviations from mean

49 51 50

( ) ( ) (49 50) (51 50) 1 1 0

0 100

Data set 1

Data set 2 50

( ) ( ) (0 50) (100 50) 50 50 0

x x x

x x x x

y y y

y y y y

The Sample Standard Deviation a measure of spread around the mean Square the deviation of each

observation from the mean find the square root of the ldquoaveragerdquo of these squared deviations

2

1

2

2 1

( )sample standard deviation

1

( )is called the sample variance

1

n

ii

n

ii

y ys

n

y ys

n

Calculations hellip

Mean = 634

Sum of squared deviations from mean = 852

(n minus 1) = 13 (n minus 1) is called degrees freedom (df)

s2 = variance = 85213 = 655 square inches

s = standard deviation = radic655 = 256 inches

Women height (inches)i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

2

1

2 )(1

1xx

ns

n

i

1 First calculate the variance s22 Then take the square root to get the

standard deviation s

2

1

)(1

1xx

ns

n

i

Meanplusmn 1 sd

Wersquoll never calculate these by hand so make sure to know how to get the standard deviation using your calculator Excel or other software

Population Standard Deviation

2

1

Denoted by the lower case Greek letter

is the size (for example =34000 for NCSU)

is the mean

( )population standard deviation

va

po

lue of typically not known

us

pulation

populatio

e

n

N

ii

N N

y

N

s

to estimate value of

Remarks

1 The standard deviation of a set of measurements is an estimate of the likely size of the chance error in a single measurement

Remarks (cont)

2 Note that s and s are always greater than or equal to zero

3 The larger the value of s (or s ) the greater the spread of the data

When does s=0 When does s =0

When all data values are the same

Remarks (cont)4 The standard deviation is the most

commonly used measure of risk in finance and businessndash Stocks Mutual Funds etc

5 Variance s2 sample variance 2 population variance Units are squared units of the original data square $ square gallons

Review Properties of s and s s and s are always greater than or

equal to 0

when does s = 0 s = 0 The larger the value of s (or s) the

greater the spread of the data the standard deviation of a set of

measurements is an estimate of the likely size of the chance error in a single measurement

Summary of Notation

2

SAMPLE

sample mean

sample median

sample variance

sample stand dev

y

m

s

s

2

POPULATION

population mean

population median

population variance

population stand dev

m

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

68-95-997 rule

Mean andStandard Deviation

(numerical)

Histogram(graphical)

68-95-997 rule

The 68-95-997 ruleIf the histogram of the data is

approximately bell-shaped then1) approximately of the measurements

are of the mean

that is in ( )

2) approximately of the measurement

68

within 1 standard deviation

95

within 2 standard deviation

s

are of the meas n

that is

y s y s

almost all

within 3 standard deviation

in ( 2 2 )

3) the measurements

are of the mean

that is in ( 3 3 )

s

y s y s

y s y s

68-95-997 rule 68 within 1 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

68

3434

y-s y y+s

68-95-997 rule 95 within 2 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

95

475 475

y-2s y y+2s

Example textbook costs

37548

4272

50

y

s

n

286 291 307 308 315 316 327328 340 342 346 347 348 348 349354 355 355 360 361 364 367 369371 373 377 380 381 382 385 385387 390 390 397 398 409 409 410418 422 424 425 426 428 433 434437 440 480

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( ) (33276 41820)

32percentage of data values in this interval 64

5068-95-997 rule 68

y s

y s y s

1 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 2 2 ) (29004 46092)

48percentage of data values in this interval 96

5068-95-997 rule 95

y s

y s y s

2 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 3 3 ) (24732 50364)

50percentage of data values in this interval 100

5068-95-997 rule 997

y s

y s y s

3 standard deviation interval about the mean

The best estimate of the standard deviation of the menrsquos weights

displayed in this dotplot is

1 10

2 15

3 20

4 40

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

Preceding slides Next

Z-scores Standardized Data Values

Measures the distance of a number from the mean in units of

the standard deviation

z-score corresponding to y

where

original data value

the sample mean

s the sample standard deviation

the z-score corresponding to

y yz

s

y

y

z y

Exam 1 y1 = 88 s1 = 6 your exam 1 score 91

Exam 2 y2 = 88 s2 = 10 your exam 2 score 92

Which score is better

1

2

91 88 3z 5

6 692 88 4

z 410 10

91 on exam 1 is better than 92 on exam 2

If data has mean and standard deviation

then standardizing a particular value of

indicates how many standard deviations

is above or below the mean

y s

y

y

y

Comparing SAT and ACT Scores

SAT Math Eleanorrsquos score 680

SAT mean =500 sd=100 ACT Math Geraldrsquos score 27

ACT mean=18 sd=6 Eleanorrsquos z-score z=(680-500)100=18 Geraldrsquos z-score z=(27-18)6=15 Eleanorrsquos score is better

Z-scores add to zeroStudentInstitutional Support to Athletic Depts For the 9 Public ACC

Schools 2013 ($ millions)

School Support y - ybar Z-score

Maryland 155 64 179

UVA 131 40 112

Louisville 109 18 050

UNC 92 01 003

VaTech 79 -12 -034

FSU 79 -12 -034

GaTech 71 -20 -056

NCSU 65 -26 -073

Clemson 38 -53 -147

Mean=91000 s=35697

Sum = 0 Sum = 0

Recently the mean tuition at 4-yr public collegesuniversities in the US was $6185 with a standard deviation of $1804 In NC the mean tuition was $4320 What is NCrsquos z-score

1 103

2 -103

3 239

4 1865

5 -1865

Section 34Measures of Position (also called Measures of Relative Standing)

Quartiles

5-Number Summary

Interquartile Range Another Measure of Spread

Boxplots

m = median = 34

Q1= first quartile = 23

Q3= third quartile = 42

1 1 062 2 123 3 164 4 195 5 156 6 217 7 238 6 239 5 2510 4 2811 3 2912 2 3313 1 3414 2 3615 3 3716 4 3817 5 3918 6 4119 7 4220 6 4521 5 4722 4 4923 3 5324 2 5625 1 61

Quartiles Measuring spread by examining the middleThe first quartile Q1 is the value in the

sample that has 25 of the data at or

below it (Q1 is the median of the lower

half of the sorted data)

The third quartile Q3 is the value in the

sample that has 75 of the data at or

below it (Q3 is the median of the upper

half of the sorted data)

Quartiles and median divide data into 4 pieces

Q1 M Q3

14 14 14 14

Quartiles are common measures of spread

httpoirpncsueduiradmit

httpoirpncsueduunivpeer

University of Southern California

Economic Value of College Majors

Rules for Calculating QuartilesStep 1 find the median of all the data (the median divides the data in half)

Step 2a find the median of the lower half this median is Q1Step 2b find the median of the upper half this median is Q3

Importantwhen n is odd include the overall median in both halveswhen n is even do not include the overall median in either half

Example 2 4 6 8 10 12 14 16 18 20 n = 10

Median m = (10+12)2 = 222 = 11

Q1 median of lower half 2 4 6 8 10

Q1 = 6

Q3 median of upper half 12 14 16 18 20

Q3 = 16

11

Pulse Rates n = 138

Stem Leaves4

3 4 5889 5 00123344410 5 555678889923 6 0001111112223333334444423 6 5555666666777778888888816 7 0000011222233444423 7 5555566666677788888899910 8 000011222410 8 55556677894 9 00122 9 584 10 0223

101 11 1

Median mean of pulses in locations 69 amp 70 median= (70+70)2=70

Q1 median of lower half (lower half = 69 smallest pulses) Q1 = pulse in ordered position 35Q1 = 63

Q3 median of upper half (upper half = 69 largest pulses) Q3= pulse in position 35 from the high end Q3=78

Below are the weights of 31 linemen on the NCSU football team What is the

value of the first quartile Q1

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 287

2 2575

3 2635

4 2625

Interquartile range another measure of spread

lower quartile Q1

middle quartile median upper quartile Q3

interquartile range (IQR)

IQR = Q3 ndash Q1

measures spread of middle 50 of the data

Example beginning pulse rates

Q3 = 78 Q1 = 63

IQR = 78 ndash 63 = 15

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 24: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Shape (cont) outliersAll 200 m Races 202 secs or less

192 1926193219381944 195 1956196219681974 198 1986199219982004 201 20160

10

20

30

40

50

60

200 m Races 202 secs or less (approx 700)

TIMES

Fre

qu

ency Usain Bolt

2008 1930Michael Johnson1996 1932

Alaska Florida

Shape (cont) Outliers

An important kind of deviation is an outlier Outliers are observations

that lie outside the overall pattern of a distribution Always look for

outliers and try to explain them

The overall pattern is fairly

symmetrical except for 2

states clearly not belonging

to the main trend Alaska

and Florida have unusual

representation of the

elderly in their population

A large gap in the

distribution is typically a

sign of an outlier

Excel Example 2012-13 NFL Salaries

3694

80

1273

609

231

2177

738

462

3081

867

692

3985

996

923

4890

126

154

5794

255

385

6698

384

615

7602

513

846

8506

643

077

9410

772

308

1031

4901

54

1121

9030

77

1212

3160

1302

7289

23

1393

1418

46

1483

5547

69

1573

9676

92

1664

3806

15

1754

7935

38

0

100

200

300

400

500

600

700

800

900

1000

Histogram

Bin

Fre

qu

ency

Statcrunch Example 2012-13 NFL Salaries

Heights of Students in Recent Stats Class (Bimodal)

ExampleGrades on a statistics exam

Data

75 66 77 66 64 73 91 65 59 86 61 86 61

58 70 77 80 58 94 78 62 79 83 54 52 45

82 48 67 55

Example-2Frequency Distribution of Grades

Class Limits Frequency40 up to 50

50 up to 60

60 up to 70

70 up to 80

80 up to 90

90 up to 100

Total

2

6

8

7

5

2

30

Example-3 Relative Frequency Distribution of Grades

Class Limits Relative Frequency40 up to 50

50 up to 60

60 up to 70

70 up to 80

80 up to 90

90 up to 100

230 = 067

630 = 200

830 = 267

730 = 233

530 = 167

230 = 067

Relative Frequency Histogram of Grades

005

10

15

20

25

30

40 50 60 70 80 90Grade

Rel

ativ

e fr

eque

ncy

100

Based on the histo-gram about what percent of the values are between 475 and 525

1 50

2 5

3 17

4 30

Stem and leaf displays Have the following general appearance

stem leaf

1 8 9

2 1 2 8 9 9

3 2 3 8 9

4 0 1

5 6 7

6 4

Example employee ages at a small company

18 21 22 19 32 33 40 41 56 57 64 28 29 29 38 39 stem 10rsquos digit leaf 1rsquos digit

18 stem=1 leaf=8 18 = 1 | 8

stem leaf

1 8 9

2 1 2 8 9 9

3 2 3 8 9

4 0 1

5 6 7

6 4

Suppose a 95 yr old is hiredstem leaf

1 8 9

2 1 2 8 9 9

3 2 3 8 9

4 0 1

5 6 7

6 4

7

8

9 5

Number of TD passes by NFL teams 2012-2013 season(stems are 10rsquos digit)

stem leaf

43

03247

2 6677789

2 01222233444

1 13467889

0 8

Pulse Rates n = 138

Stem Leaves 4 3 4 588 9 5 001233444 10 5 5556788899 23 6 00011111122233333344444 23 6 55556666667777788888888 16 7 00000112222334444 23 7 55555666666777888888999 10 8 0000112224 10 8 5555667789 4 9 0012 2 9 58 4 10 0223 10 1 11 1

AdvantagesDisadvantages of Stem-and-Leaf Displays

Advantages

1) each measurement displayed

2) ascending order in each stem row

3) relatively simple (data set not too large) Disadvantages

display becomes unwieldy for large data sets

Population of 185 US cities with between 100000 and 500000

Multiply stems by 100000

Back-to-back stem-and-leaf displays TD passes by NFL teams 1999-2000 2012-13multiply stems by 10

1999-2000 2012-13

2 4 03

6 3 7

2 3 24

6655 2 6677789

43322221100 2 01222233444

9998887666 1 67889

421 1 134

0 8

Below is a stem-and-leaf display for the pulse rates of 24 women at a health clinic How many pulses are between 67 and 77

Stems are 10rsquos digits

1 4

2 6

3 8

4 10

5 12

Other Graphical Methods for Data Time plots

plot observations in time order time on horizontal axis variable on vertical axis

Time series

measurements are taken at regular intervals (monthly unemployment quarterly GDP weather records electricity demand etc)

Heat maps word walls

Unemployment Rate by Educational Attainment

Water Use During Super Bowl XLV(Packers 31 Steelers 25)

Heat Maps

Word Wall (customer feedback)

Section 32Describing the Center of Data

Mean

Median

2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability (next section)

measures how ldquospread outrdquo the data is

Notation for Data Valuesand Sample Mean

1 2

1 2

3

The sample size is denoted by

For a variable denoted by its observations are denoted by

A common measure of center is the sample mean

The sample mean is denoted by

Shorte

n

n

y y yy

n

y

y y y y

y

n

1 21

1

ned expression for using the symbol

(uppercase Greek letter sigma)n

n

i

i n

i

i

y

y y y

yy

n

y

Simple Example of Sample Mean

Weekly TV viewing time in hours of 7 randomly selected 4th graders

19 40 16 12 10 6 and 97

1

7

1

19 40 16 12 10 6 9 112

11216

7 7

ii

ii

y

yy

Population Mean

1

population

population mea

Denoted by the Greek letter

is the size (for example =34000 for NCSU)

the value of is typically not known

we often use the sample mean

to estimat

n

e the unknown

N

ii

y

N N

y

N

value of

Connection Between Mean and Histogram

A histogram balances when supported at the mean Mean x = 1406

Histogram

0

10

20

30

40

50

60

70

118

5

125

5

132

5

139

5

146

5

153

5

16

05

Mo

re

Absences f rom Work

Fre

qu

en

cy

Frequency

The median anothermeasure of center

Given a set of n data values arranged in order of magnitude

Median= middle value n odd

mean of 2 middle values n even

Ex 2 4 6 8 10 n=5 median=6 Ex 2 4 6 8 n=4 median=(4+6)2=5

Student Pulse Rates (n=62)

38 59 60 60 62 62 63 63 64 64 65 67 68 70 70 70 70 70 70 70 71 71 72 72 73 74 74 75 75 75 75 76 77 77 77 77 78 78 79 79 80 80 80 84 84 85 85 87 90 90 91 92 93 94 94 95 96 96 96 98 98 103

Median = (75+76)2 = 755

The median splits the histogram into 2 halves of equal area

Mean balance pointMedian 50 area each half

mean 5526 years median 577years

Medians are used often

Year 2011 baseball salaries

Median $1450000 (max=$32000000 Alex Rodriguez min=$414000)

Median fan age MLB 45 NFL 43 NBA 41 NHL 39

Median existing home sales price May 2011 $166500 May 2010 $174600

Median household income (2008 dollars) 2009 $50221 2008 $52029

Examples Example n = 7

175 28 32 139 141 253 458 Example n = 7 (ordered) 28 32 139 141 175 253 458 Example n = 8

175 28 32 139 141 253 357 458

Example n =8 (ordered)

28 32 139 141 175 253 357 458

m = 141

m = (141+175)2 = 158

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496049604971524555467586

1 5245

2 49655

3 4960

4 4971

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496052455546497155877586

1 5245

2 49655

3 5546

4 4971

Properties of Mean Median1The mean and median are unique that is a

data set has only 1 mean and 1 median (the mean and median are not necessarily equal)

2The mean uses the value of every number in the data set the median does not

14

20 4 6Ex 2 4 6 8 5 5

4 2

21 4 6Ex 2 4 6 9 5 5

4 2

x m

x m

Example class pulse rates

53 64 67 67 70 76 77 77 78 83 84 85 85 89 90 90 90 90 91 96 98 103 140

23

1

23

844823

location 12th obs 85

ii

n

xx

m m

2010 2014 baseball salaries

2010

n = 845

mean = $3297828

median = $1330000

max = $33000000

2014

n = 848

mean = $3932912

median = $1456250

max = $28000000

>

Disadvantage of the mean

Can be greatly influenced by just a few observations that are much greater or much smaller than the rest of the data

Mean Median Maximum Baseball Salaries 1985 - 201419

85

1987

1989

1991

1993

1995

1997

1999

2001

2003

2005

2007

2009

2011

2013

200000

700000

1200000

1700000

2200000

2700000

3200000

3700000

0

5000000

10000000

15000000

20000000

25000000

30000000

35000000

Baseball Salaries Mean Median and Maximum 1985-2014

Mean Median Maximum

Year

Mea

n M

edia

n S

alar

y

Max

imu

m S

alar

y

Skewness comparing the mean and median

Skewed to the right (positively skewed) meangtmedian

53

490

102 7235 21 26 17 8 10 2 3 1 0 0 1

0

100

200

300

400

500

600

Freq

uenc

y

Salary ($1000s)

2011 Baseball Salaries

Skewed to the left negatively skewed

Mean lt median mean=78 median=87

Histogram of Exam Scores

0

10

20

30

20 30 40 50 60 70 80 90 100Exam Scores

Fre

qu

en

cy

Symmetric data

mean median approx equal

Bank Customers 1000-1100 am

0

5

10

15

20

Number of Customers

Fre

qu

en

cy

Section 33Describing Variability of Data

Standard Deviation

Using the Mean and Standard Deviation Together 68-95-997

Rule (Empirical Rule)

Recall 2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability

measures how ldquospread outrdquo the data is

Ways to measure variability

1 range=largest-smallest

ok sometimes in general too crude sensitive to one large or small obs

1

2 where

the middle is the mean

deviation of from the mean

( ) sum the deviations of all the s from

measure spread from the middle

i i

n

i ii

y

y y y

y y y y

1

( ) 0 always tells us nothingn

ii

y y

Example

1 2

1 2

1 2

1 2

sum of deviations from mean

49 51 50

( ) ( ) (49 50) (51 50) 1 1 0

0 100

Data set 1

Data set 2 50

( ) ( ) (0 50) (100 50) 50 50 0

x x x

x x x x

y y y

y y y y

The Sample Standard Deviation a measure of spread around the mean Square the deviation of each

observation from the mean find the square root of the ldquoaveragerdquo of these squared deviations

2

1

2

2 1

( )sample standard deviation

1

( )is called the sample variance

1

n

ii

n

ii

y ys

n

y ys

n

Calculations hellip

Mean = 634

Sum of squared deviations from mean = 852

(n minus 1) = 13 (n minus 1) is called degrees freedom (df)

s2 = variance = 85213 = 655 square inches

s = standard deviation = radic655 = 256 inches

Women height (inches)i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

2

1

2 )(1

1xx

ns

n

i

1 First calculate the variance s22 Then take the square root to get the

standard deviation s

2

1

)(1

1xx

ns

n

i

Meanplusmn 1 sd

Wersquoll never calculate these by hand so make sure to know how to get the standard deviation using your calculator Excel or other software

Population Standard Deviation

2

1

Denoted by the lower case Greek letter

is the size (for example =34000 for NCSU)

is the mean

( )population standard deviation

va

po

lue of typically not known

us

pulation

populatio

e

n

N

ii

N N

y

N

s

to estimate value of

Remarks

1 The standard deviation of a set of measurements is an estimate of the likely size of the chance error in a single measurement

Remarks (cont)

2 Note that s and s are always greater than or equal to zero

3 The larger the value of s (or s ) the greater the spread of the data

When does s=0 When does s =0

When all data values are the same

Remarks (cont)4 The standard deviation is the most

commonly used measure of risk in finance and businessndash Stocks Mutual Funds etc

5 Variance s2 sample variance 2 population variance Units are squared units of the original data square $ square gallons

Review Properties of s and s s and s are always greater than or

equal to 0

when does s = 0 s = 0 The larger the value of s (or s) the

greater the spread of the data the standard deviation of a set of

measurements is an estimate of the likely size of the chance error in a single measurement

Summary of Notation

2

SAMPLE

sample mean

sample median

sample variance

sample stand dev

y

m

s

s

2

POPULATION

population mean

population median

population variance

population stand dev

m

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

68-95-997 rule

Mean andStandard Deviation

(numerical)

Histogram(graphical)

68-95-997 rule

The 68-95-997 ruleIf the histogram of the data is

approximately bell-shaped then1) approximately of the measurements

are of the mean

that is in ( )

2) approximately of the measurement

68

within 1 standard deviation

95

within 2 standard deviation

s

are of the meas n

that is

y s y s

almost all

within 3 standard deviation

in ( 2 2 )

3) the measurements

are of the mean

that is in ( 3 3 )

s

y s y s

y s y s

68-95-997 rule 68 within 1 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

68

3434

y-s y y+s

68-95-997 rule 95 within 2 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

95

475 475

y-2s y y+2s

Example textbook costs

37548

4272

50

y

s

n

286 291 307 308 315 316 327328 340 342 346 347 348 348 349354 355 355 360 361 364 367 369371 373 377 380 381 382 385 385387 390 390 397 398 409 409 410418 422 424 425 426 428 433 434437 440 480

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( ) (33276 41820)

32percentage of data values in this interval 64

5068-95-997 rule 68

y s

y s y s

1 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 2 2 ) (29004 46092)

48percentage of data values in this interval 96

5068-95-997 rule 95

y s

y s y s

2 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 3 3 ) (24732 50364)

50percentage of data values in this interval 100

5068-95-997 rule 997

y s

y s y s

3 standard deviation interval about the mean

The best estimate of the standard deviation of the menrsquos weights

displayed in this dotplot is

1 10

2 15

3 20

4 40

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

Preceding slides Next

Z-scores Standardized Data Values

Measures the distance of a number from the mean in units of

the standard deviation

z-score corresponding to y

where

original data value

the sample mean

s the sample standard deviation

the z-score corresponding to

y yz

s

y

y

z y

Exam 1 y1 = 88 s1 = 6 your exam 1 score 91

Exam 2 y2 = 88 s2 = 10 your exam 2 score 92

Which score is better

1

2

91 88 3z 5

6 692 88 4

z 410 10

91 on exam 1 is better than 92 on exam 2

If data has mean and standard deviation

then standardizing a particular value of

indicates how many standard deviations

is above or below the mean

y s

y

y

y

Comparing SAT and ACT Scores

SAT Math Eleanorrsquos score 680

SAT mean =500 sd=100 ACT Math Geraldrsquos score 27

ACT mean=18 sd=6 Eleanorrsquos z-score z=(680-500)100=18 Geraldrsquos z-score z=(27-18)6=15 Eleanorrsquos score is better

Z-scores add to zeroStudentInstitutional Support to Athletic Depts For the 9 Public ACC

Schools 2013 ($ millions)

School Support y - ybar Z-score

Maryland 155 64 179

UVA 131 40 112

Louisville 109 18 050

UNC 92 01 003

VaTech 79 -12 -034

FSU 79 -12 -034

GaTech 71 -20 -056

NCSU 65 -26 -073

Clemson 38 -53 -147

Mean=91000 s=35697

Sum = 0 Sum = 0

Recently the mean tuition at 4-yr public collegesuniversities in the US was $6185 with a standard deviation of $1804 In NC the mean tuition was $4320 What is NCrsquos z-score

1 103

2 -103

3 239

4 1865

5 -1865

Section 34Measures of Position (also called Measures of Relative Standing)

Quartiles

5-Number Summary

Interquartile Range Another Measure of Spread

Boxplots

m = median = 34

Q1= first quartile = 23

Q3= third quartile = 42

1 1 062 2 123 3 164 4 195 5 156 6 217 7 238 6 239 5 2510 4 2811 3 2912 2 3313 1 3414 2 3615 3 3716 4 3817 5 3918 6 4119 7 4220 6 4521 5 4722 4 4923 3 5324 2 5625 1 61

Quartiles Measuring spread by examining the middleThe first quartile Q1 is the value in the

sample that has 25 of the data at or

below it (Q1 is the median of the lower

half of the sorted data)

The third quartile Q3 is the value in the

sample that has 75 of the data at or

below it (Q3 is the median of the upper

half of the sorted data)

Quartiles and median divide data into 4 pieces

Q1 M Q3

14 14 14 14

Quartiles are common measures of spread

httpoirpncsueduiradmit

httpoirpncsueduunivpeer

University of Southern California

Economic Value of College Majors

Rules for Calculating QuartilesStep 1 find the median of all the data (the median divides the data in half)

Step 2a find the median of the lower half this median is Q1Step 2b find the median of the upper half this median is Q3

Importantwhen n is odd include the overall median in both halveswhen n is even do not include the overall median in either half

Example 2 4 6 8 10 12 14 16 18 20 n = 10

Median m = (10+12)2 = 222 = 11

Q1 median of lower half 2 4 6 8 10

Q1 = 6

Q3 median of upper half 12 14 16 18 20

Q3 = 16

11

Pulse Rates n = 138

Stem Leaves4

3 4 5889 5 00123344410 5 555678889923 6 0001111112223333334444423 6 5555666666777778888888816 7 0000011222233444423 7 5555566666677788888899910 8 000011222410 8 55556677894 9 00122 9 584 10 0223

101 11 1

Median mean of pulses in locations 69 amp 70 median= (70+70)2=70

Q1 median of lower half (lower half = 69 smallest pulses) Q1 = pulse in ordered position 35Q1 = 63

Q3 median of upper half (upper half = 69 largest pulses) Q3= pulse in position 35 from the high end Q3=78

Below are the weights of 31 linemen on the NCSU football team What is the

value of the first quartile Q1

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 287

2 2575

3 2635

4 2625

Interquartile range another measure of spread

lower quartile Q1

middle quartile median upper quartile Q3

interquartile range (IQR)

IQR = Q3 ndash Q1

measures spread of middle 50 of the data

Example beginning pulse rates

Q3 = 78 Q1 = 63

IQR = 78 ndash 63 = 15

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 25: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Alaska Florida

Shape (cont) Outliers

An important kind of deviation is an outlier Outliers are observations

that lie outside the overall pattern of a distribution Always look for

outliers and try to explain them

The overall pattern is fairly

symmetrical except for 2

states clearly not belonging

to the main trend Alaska

and Florida have unusual

representation of the

elderly in their population

A large gap in the

distribution is typically a

sign of an outlier

Excel Example 2012-13 NFL Salaries

3694

80

1273

609

231

2177

738

462

3081

867

692

3985

996

923

4890

126

154

5794

255

385

6698

384

615

7602

513

846

8506

643

077

9410

772

308

1031

4901

54

1121

9030

77

1212

3160

1302

7289

23

1393

1418

46

1483

5547

69

1573

9676

92

1664

3806

15

1754

7935

38

0

100

200

300

400

500

600

700

800

900

1000

Histogram

Bin

Fre

qu

ency

Statcrunch Example 2012-13 NFL Salaries

Heights of Students in Recent Stats Class (Bimodal)

ExampleGrades on a statistics exam

Data

75 66 77 66 64 73 91 65 59 86 61 86 61

58 70 77 80 58 94 78 62 79 83 54 52 45

82 48 67 55

Example-2Frequency Distribution of Grades

Class Limits Frequency40 up to 50

50 up to 60

60 up to 70

70 up to 80

80 up to 90

90 up to 100

Total

2

6

8

7

5

2

30

Example-3 Relative Frequency Distribution of Grades

Class Limits Relative Frequency40 up to 50

50 up to 60

60 up to 70

70 up to 80

80 up to 90

90 up to 100

230 = 067

630 = 200

830 = 267

730 = 233

530 = 167

230 = 067

Relative Frequency Histogram of Grades

005

10

15

20

25

30

40 50 60 70 80 90Grade

Rel

ativ

e fr

eque

ncy

100

Based on the histo-gram about what percent of the values are between 475 and 525

1 50

2 5

3 17

4 30

Stem and leaf displays Have the following general appearance

stem leaf

1 8 9

2 1 2 8 9 9

3 2 3 8 9

4 0 1

5 6 7

6 4

Example employee ages at a small company

18 21 22 19 32 33 40 41 56 57 64 28 29 29 38 39 stem 10rsquos digit leaf 1rsquos digit

18 stem=1 leaf=8 18 = 1 | 8

stem leaf

1 8 9

2 1 2 8 9 9

3 2 3 8 9

4 0 1

5 6 7

6 4

Suppose a 95 yr old is hiredstem leaf

1 8 9

2 1 2 8 9 9

3 2 3 8 9

4 0 1

5 6 7

6 4

7

8

9 5

Number of TD passes by NFL teams 2012-2013 season(stems are 10rsquos digit)

stem leaf

43

03247

2 6677789

2 01222233444

1 13467889

0 8

Pulse Rates n = 138

Stem Leaves 4 3 4 588 9 5 001233444 10 5 5556788899 23 6 00011111122233333344444 23 6 55556666667777788888888 16 7 00000112222334444 23 7 55555666666777888888999 10 8 0000112224 10 8 5555667789 4 9 0012 2 9 58 4 10 0223 10 1 11 1

AdvantagesDisadvantages of Stem-and-Leaf Displays

Advantages

1) each measurement displayed

2) ascending order in each stem row

3) relatively simple (data set not too large) Disadvantages

display becomes unwieldy for large data sets

Population of 185 US cities with between 100000 and 500000

Multiply stems by 100000

Back-to-back stem-and-leaf displays TD passes by NFL teams 1999-2000 2012-13multiply stems by 10

1999-2000 2012-13

2 4 03

6 3 7

2 3 24

6655 2 6677789

43322221100 2 01222233444

9998887666 1 67889

421 1 134

0 8

Below is a stem-and-leaf display for the pulse rates of 24 women at a health clinic How many pulses are between 67 and 77

Stems are 10rsquos digits

1 4

2 6

3 8

4 10

5 12

Other Graphical Methods for Data Time plots

plot observations in time order time on horizontal axis variable on vertical axis

Time series

measurements are taken at regular intervals (monthly unemployment quarterly GDP weather records electricity demand etc)

Heat maps word walls

Unemployment Rate by Educational Attainment

Water Use During Super Bowl XLV(Packers 31 Steelers 25)

Heat Maps

Word Wall (customer feedback)

Section 32Describing the Center of Data

Mean

Median

2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability (next section)

measures how ldquospread outrdquo the data is

Notation for Data Valuesand Sample Mean

1 2

1 2

3

The sample size is denoted by

For a variable denoted by its observations are denoted by

A common measure of center is the sample mean

The sample mean is denoted by

Shorte

n

n

y y yy

n

y

y y y y

y

n

1 21

1

ned expression for using the symbol

(uppercase Greek letter sigma)n

n

i

i n

i

i

y

y y y

yy

n

y

Simple Example of Sample Mean

Weekly TV viewing time in hours of 7 randomly selected 4th graders

19 40 16 12 10 6 and 97

1

7

1

19 40 16 12 10 6 9 112

11216

7 7

ii

ii

y

yy

Population Mean

1

population

population mea

Denoted by the Greek letter

is the size (for example =34000 for NCSU)

the value of is typically not known

we often use the sample mean

to estimat

n

e the unknown

N

ii

y

N N

y

N

value of

Connection Between Mean and Histogram

A histogram balances when supported at the mean Mean x = 1406

Histogram

0

10

20

30

40

50

60

70

118

5

125

5

132

5

139

5

146

5

153

5

16

05

Mo

re

Absences f rom Work

Fre

qu

en

cy

Frequency

The median anothermeasure of center

Given a set of n data values arranged in order of magnitude

Median= middle value n odd

mean of 2 middle values n even

Ex 2 4 6 8 10 n=5 median=6 Ex 2 4 6 8 n=4 median=(4+6)2=5

Student Pulse Rates (n=62)

38 59 60 60 62 62 63 63 64 64 65 67 68 70 70 70 70 70 70 70 71 71 72 72 73 74 74 75 75 75 75 76 77 77 77 77 78 78 79 79 80 80 80 84 84 85 85 87 90 90 91 92 93 94 94 95 96 96 96 98 98 103

Median = (75+76)2 = 755

The median splits the histogram into 2 halves of equal area

Mean balance pointMedian 50 area each half

mean 5526 years median 577years

Medians are used often

Year 2011 baseball salaries

Median $1450000 (max=$32000000 Alex Rodriguez min=$414000)

Median fan age MLB 45 NFL 43 NBA 41 NHL 39

Median existing home sales price May 2011 $166500 May 2010 $174600

Median household income (2008 dollars) 2009 $50221 2008 $52029

Examples Example n = 7

175 28 32 139 141 253 458 Example n = 7 (ordered) 28 32 139 141 175 253 458 Example n = 8

175 28 32 139 141 253 357 458

Example n =8 (ordered)

28 32 139 141 175 253 357 458

m = 141

m = (141+175)2 = 158

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496049604971524555467586

1 5245

2 49655

3 4960

4 4971

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496052455546497155877586

1 5245

2 49655

3 5546

4 4971

Properties of Mean Median1The mean and median are unique that is a

data set has only 1 mean and 1 median (the mean and median are not necessarily equal)

2The mean uses the value of every number in the data set the median does not

14

20 4 6Ex 2 4 6 8 5 5

4 2

21 4 6Ex 2 4 6 9 5 5

4 2

x m

x m

Example class pulse rates

53 64 67 67 70 76 77 77 78 83 84 85 85 89 90 90 90 90 91 96 98 103 140

23

1

23

844823

location 12th obs 85

ii

n

xx

m m

2010 2014 baseball salaries

2010

n = 845

mean = $3297828

median = $1330000

max = $33000000

2014

n = 848

mean = $3932912

median = $1456250

max = $28000000

>

Disadvantage of the mean

Can be greatly influenced by just a few observations that are much greater or much smaller than the rest of the data

Mean Median Maximum Baseball Salaries 1985 - 201419

85

1987

1989

1991

1993

1995

1997

1999

2001

2003

2005

2007

2009

2011

2013

200000

700000

1200000

1700000

2200000

2700000

3200000

3700000

0

5000000

10000000

15000000

20000000

25000000

30000000

35000000

Baseball Salaries Mean Median and Maximum 1985-2014

Mean Median Maximum

Year

Mea

n M

edia

n S

alar

y

Max

imu

m S

alar

y

Skewness comparing the mean and median

Skewed to the right (positively skewed) meangtmedian

53

490

102 7235 21 26 17 8 10 2 3 1 0 0 1

0

100

200

300

400

500

600

Freq

uenc

y

Salary ($1000s)

2011 Baseball Salaries

Skewed to the left negatively skewed

Mean lt median mean=78 median=87

Histogram of Exam Scores

0

10

20

30

20 30 40 50 60 70 80 90 100Exam Scores

Fre

qu

en

cy

Symmetric data

mean median approx equal

Bank Customers 1000-1100 am

0

5

10

15

20

Number of Customers

Fre

qu

en

cy

Section 33Describing Variability of Data

Standard Deviation

Using the Mean and Standard Deviation Together 68-95-997

Rule (Empirical Rule)

Recall 2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability

measures how ldquospread outrdquo the data is

Ways to measure variability

1 range=largest-smallest

ok sometimes in general too crude sensitive to one large or small obs

1

2 where

the middle is the mean

deviation of from the mean

( ) sum the deviations of all the s from

measure spread from the middle

i i

n

i ii

y

y y y

y y y y

1

( ) 0 always tells us nothingn

ii

y y

Example

1 2

1 2

1 2

1 2

sum of deviations from mean

49 51 50

( ) ( ) (49 50) (51 50) 1 1 0

0 100

Data set 1

Data set 2 50

( ) ( ) (0 50) (100 50) 50 50 0

x x x

x x x x

y y y

y y y y

The Sample Standard Deviation a measure of spread around the mean Square the deviation of each

observation from the mean find the square root of the ldquoaveragerdquo of these squared deviations

2

1

2

2 1

( )sample standard deviation

1

( )is called the sample variance

1

n

ii

n

ii

y ys

n

y ys

n

Calculations hellip

Mean = 634

Sum of squared deviations from mean = 852

(n minus 1) = 13 (n minus 1) is called degrees freedom (df)

s2 = variance = 85213 = 655 square inches

s = standard deviation = radic655 = 256 inches

Women height (inches)i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

2

1

2 )(1

1xx

ns

n

i

1 First calculate the variance s22 Then take the square root to get the

standard deviation s

2

1

)(1

1xx

ns

n

i

Meanplusmn 1 sd

Wersquoll never calculate these by hand so make sure to know how to get the standard deviation using your calculator Excel or other software

Population Standard Deviation

2

1

Denoted by the lower case Greek letter

is the size (for example =34000 for NCSU)

is the mean

( )population standard deviation

va

po

lue of typically not known

us

pulation

populatio

e

n

N

ii

N N

y

N

s

to estimate value of

Remarks

1 The standard deviation of a set of measurements is an estimate of the likely size of the chance error in a single measurement

Remarks (cont)

2 Note that s and s are always greater than or equal to zero

3 The larger the value of s (or s ) the greater the spread of the data

When does s=0 When does s =0

When all data values are the same

Remarks (cont)4 The standard deviation is the most

commonly used measure of risk in finance and businessndash Stocks Mutual Funds etc

5 Variance s2 sample variance 2 population variance Units are squared units of the original data square $ square gallons

Review Properties of s and s s and s are always greater than or

equal to 0

when does s = 0 s = 0 The larger the value of s (or s) the

greater the spread of the data the standard deviation of a set of

measurements is an estimate of the likely size of the chance error in a single measurement

Summary of Notation

2

SAMPLE

sample mean

sample median

sample variance

sample stand dev

y

m

s

s

2

POPULATION

population mean

population median

population variance

population stand dev

m

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

68-95-997 rule

Mean andStandard Deviation

(numerical)

Histogram(graphical)

68-95-997 rule

The 68-95-997 ruleIf the histogram of the data is

approximately bell-shaped then1) approximately of the measurements

are of the mean

that is in ( )

2) approximately of the measurement

68

within 1 standard deviation

95

within 2 standard deviation

s

are of the meas n

that is

y s y s

almost all

within 3 standard deviation

in ( 2 2 )

3) the measurements

are of the mean

that is in ( 3 3 )

s

y s y s

y s y s

68-95-997 rule 68 within 1 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

68

3434

y-s y y+s

68-95-997 rule 95 within 2 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

95

475 475

y-2s y y+2s

Example textbook costs

37548

4272

50

y

s

n

286 291 307 308 315 316 327328 340 342 346 347 348 348 349354 355 355 360 361 364 367 369371 373 377 380 381 382 385 385387 390 390 397 398 409 409 410418 422 424 425 426 428 433 434437 440 480

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( ) (33276 41820)

32percentage of data values in this interval 64

5068-95-997 rule 68

y s

y s y s

1 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 2 2 ) (29004 46092)

48percentage of data values in this interval 96

5068-95-997 rule 95

y s

y s y s

2 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 3 3 ) (24732 50364)

50percentage of data values in this interval 100

5068-95-997 rule 997

y s

y s y s

3 standard deviation interval about the mean

The best estimate of the standard deviation of the menrsquos weights

displayed in this dotplot is

1 10

2 15

3 20

4 40

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

Preceding slides Next

Z-scores Standardized Data Values

Measures the distance of a number from the mean in units of

the standard deviation

z-score corresponding to y

where

original data value

the sample mean

s the sample standard deviation

the z-score corresponding to

y yz

s

y

y

z y

Exam 1 y1 = 88 s1 = 6 your exam 1 score 91

Exam 2 y2 = 88 s2 = 10 your exam 2 score 92

Which score is better

1

2

91 88 3z 5

6 692 88 4

z 410 10

91 on exam 1 is better than 92 on exam 2

If data has mean and standard deviation

then standardizing a particular value of

indicates how many standard deviations

is above or below the mean

y s

y

y

y

Comparing SAT and ACT Scores

SAT Math Eleanorrsquos score 680

SAT mean =500 sd=100 ACT Math Geraldrsquos score 27

ACT mean=18 sd=6 Eleanorrsquos z-score z=(680-500)100=18 Geraldrsquos z-score z=(27-18)6=15 Eleanorrsquos score is better

Z-scores add to zeroStudentInstitutional Support to Athletic Depts For the 9 Public ACC

Schools 2013 ($ millions)

School Support y - ybar Z-score

Maryland 155 64 179

UVA 131 40 112

Louisville 109 18 050

UNC 92 01 003

VaTech 79 -12 -034

FSU 79 -12 -034

GaTech 71 -20 -056

NCSU 65 -26 -073

Clemson 38 -53 -147

Mean=91000 s=35697

Sum = 0 Sum = 0

Recently the mean tuition at 4-yr public collegesuniversities in the US was $6185 with a standard deviation of $1804 In NC the mean tuition was $4320 What is NCrsquos z-score

1 103

2 -103

3 239

4 1865

5 -1865

Section 34Measures of Position (also called Measures of Relative Standing)

Quartiles

5-Number Summary

Interquartile Range Another Measure of Spread

Boxplots

m = median = 34

Q1= first quartile = 23

Q3= third quartile = 42

1 1 062 2 123 3 164 4 195 5 156 6 217 7 238 6 239 5 2510 4 2811 3 2912 2 3313 1 3414 2 3615 3 3716 4 3817 5 3918 6 4119 7 4220 6 4521 5 4722 4 4923 3 5324 2 5625 1 61

Quartiles Measuring spread by examining the middleThe first quartile Q1 is the value in the

sample that has 25 of the data at or

below it (Q1 is the median of the lower

half of the sorted data)

The third quartile Q3 is the value in the

sample that has 75 of the data at or

below it (Q3 is the median of the upper

half of the sorted data)

Quartiles and median divide data into 4 pieces

Q1 M Q3

14 14 14 14

Quartiles are common measures of spread

httpoirpncsueduiradmit

httpoirpncsueduunivpeer

University of Southern California

Economic Value of College Majors

Rules for Calculating QuartilesStep 1 find the median of all the data (the median divides the data in half)

Step 2a find the median of the lower half this median is Q1Step 2b find the median of the upper half this median is Q3

Importantwhen n is odd include the overall median in both halveswhen n is even do not include the overall median in either half

Example 2 4 6 8 10 12 14 16 18 20 n = 10

Median m = (10+12)2 = 222 = 11

Q1 median of lower half 2 4 6 8 10

Q1 = 6

Q3 median of upper half 12 14 16 18 20

Q3 = 16

11

Pulse Rates n = 138

Stem Leaves4

3 4 5889 5 00123344410 5 555678889923 6 0001111112223333334444423 6 5555666666777778888888816 7 0000011222233444423 7 5555566666677788888899910 8 000011222410 8 55556677894 9 00122 9 584 10 0223

101 11 1

Median mean of pulses in locations 69 amp 70 median= (70+70)2=70

Q1 median of lower half (lower half = 69 smallest pulses) Q1 = pulse in ordered position 35Q1 = 63

Q3 median of upper half (upper half = 69 largest pulses) Q3= pulse in position 35 from the high end Q3=78

Below are the weights of 31 linemen on the NCSU football team What is the

value of the first quartile Q1

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 287

2 2575

3 2635

4 2625

Interquartile range another measure of spread

lower quartile Q1

middle quartile median upper quartile Q3

interquartile range (IQR)

IQR = Q3 ndash Q1

measures spread of middle 50 of the data

Example beginning pulse rates

Q3 = 78 Q1 = 63

IQR = 78 ndash 63 = 15

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 26: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Excel Example 2012-13 NFL Salaries

3694

80

1273

609

231

2177

738

462

3081

867

692

3985

996

923

4890

126

154

5794

255

385

6698

384

615

7602

513

846

8506

643

077

9410

772

308

1031

4901

54

1121

9030

77

1212

3160

1302

7289

23

1393

1418

46

1483

5547

69

1573

9676

92

1664

3806

15

1754

7935

38

0

100

200

300

400

500

600

700

800

900

1000

Histogram

Bin

Fre

qu

ency

Statcrunch Example 2012-13 NFL Salaries

Heights of Students in Recent Stats Class (Bimodal)

ExampleGrades on a statistics exam

Data

75 66 77 66 64 73 91 65 59 86 61 86 61

58 70 77 80 58 94 78 62 79 83 54 52 45

82 48 67 55

Example-2Frequency Distribution of Grades

Class Limits Frequency40 up to 50

50 up to 60

60 up to 70

70 up to 80

80 up to 90

90 up to 100

Total

2

6

8

7

5

2

30

Example-3 Relative Frequency Distribution of Grades

Class Limits Relative Frequency40 up to 50

50 up to 60

60 up to 70

70 up to 80

80 up to 90

90 up to 100

230 = 067

630 = 200

830 = 267

730 = 233

530 = 167

230 = 067

Relative Frequency Histogram of Grades

005

10

15

20

25

30

40 50 60 70 80 90Grade

Rel

ativ

e fr

eque

ncy

100

Based on the histo-gram about what percent of the values are between 475 and 525

1 50

2 5

3 17

4 30

Stem and leaf displays Have the following general appearance

stem leaf

1 8 9

2 1 2 8 9 9

3 2 3 8 9

4 0 1

5 6 7

6 4

Example employee ages at a small company

18 21 22 19 32 33 40 41 56 57 64 28 29 29 38 39 stem 10rsquos digit leaf 1rsquos digit

18 stem=1 leaf=8 18 = 1 | 8

stem leaf

1 8 9

2 1 2 8 9 9

3 2 3 8 9

4 0 1

5 6 7

6 4

Suppose a 95 yr old is hiredstem leaf

1 8 9

2 1 2 8 9 9

3 2 3 8 9

4 0 1

5 6 7

6 4

7

8

9 5

Number of TD passes by NFL teams 2012-2013 season(stems are 10rsquos digit)

stem leaf

43

03247

2 6677789

2 01222233444

1 13467889

0 8

Pulse Rates n = 138

Stem Leaves 4 3 4 588 9 5 001233444 10 5 5556788899 23 6 00011111122233333344444 23 6 55556666667777788888888 16 7 00000112222334444 23 7 55555666666777888888999 10 8 0000112224 10 8 5555667789 4 9 0012 2 9 58 4 10 0223 10 1 11 1

AdvantagesDisadvantages of Stem-and-Leaf Displays

Advantages

1) each measurement displayed

2) ascending order in each stem row

3) relatively simple (data set not too large) Disadvantages

display becomes unwieldy for large data sets

Population of 185 US cities with between 100000 and 500000

Multiply stems by 100000

Back-to-back stem-and-leaf displays TD passes by NFL teams 1999-2000 2012-13multiply stems by 10

1999-2000 2012-13

2 4 03

6 3 7

2 3 24

6655 2 6677789

43322221100 2 01222233444

9998887666 1 67889

421 1 134

0 8

Below is a stem-and-leaf display for the pulse rates of 24 women at a health clinic How many pulses are between 67 and 77

Stems are 10rsquos digits

1 4

2 6

3 8

4 10

5 12

Other Graphical Methods for Data Time plots

plot observations in time order time on horizontal axis variable on vertical axis

Time series

measurements are taken at regular intervals (monthly unemployment quarterly GDP weather records electricity demand etc)

Heat maps word walls

Unemployment Rate by Educational Attainment

Water Use During Super Bowl XLV(Packers 31 Steelers 25)

Heat Maps

Word Wall (customer feedback)

Section 32Describing the Center of Data

Mean

Median

2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability (next section)

measures how ldquospread outrdquo the data is

Notation for Data Valuesand Sample Mean

1 2

1 2

3

The sample size is denoted by

For a variable denoted by its observations are denoted by

A common measure of center is the sample mean

The sample mean is denoted by

Shorte

n

n

y y yy

n

y

y y y y

y

n

1 21

1

ned expression for using the symbol

(uppercase Greek letter sigma)n

n

i

i n

i

i

y

y y y

yy

n

y

Simple Example of Sample Mean

Weekly TV viewing time in hours of 7 randomly selected 4th graders

19 40 16 12 10 6 and 97

1

7

1

19 40 16 12 10 6 9 112

11216

7 7

ii

ii

y

yy

Population Mean

1

population

population mea

Denoted by the Greek letter

is the size (for example =34000 for NCSU)

the value of is typically not known

we often use the sample mean

to estimat

n

e the unknown

N

ii

y

N N

y

N

value of

Connection Between Mean and Histogram

A histogram balances when supported at the mean Mean x = 1406

Histogram

0

10

20

30

40

50

60

70

118

5

125

5

132

5

139

5

146

5

153

5

16

05

Mo

re

Absences f rom Work

Fre

qu

en

cy

Frequency

The median anothermeasure of center

Given a set of n data values arranged in order of magnitude

Median= middle value n odd

mean of 2 middle values n even

Ex 2 4 6 8 10 n=5 median=6 Ex 2 4 6 8 n=4 median=(4+6)2=5

Student Pulse Rates (n=62)

38 59 60 60 62 62 63 63 64 64 65 67 68 70 70 70 70 70 70 70 71 71 72 72 73 74 74 75 75 75 75 76 77 77 77 77 78 78 79 79 80 80 80 84 84 85 85 87 90 90 91 92 93 94 94 95 96 96 96 98 98 103

Median = (75+76)2 = 755

The median splits the histogram into 2 halves of equal area

Mean balance pointMedian 50 area each half

mean 5526 years median 577years

Medians are used often

Year 2011 baseball salaries

Median $1450000 (max=$32000000 Alex Rodriguez min=$414000)

Median fan age MLB 45 NFL 43 NBA 41 NHL 39

Median existing home sales price May 2011 $166500 May 2010 $174600

Median household income (2008 dollars) 2009 $50221 2008 $52029

Examples Example n = 7

175 28 32 139 141 253 458 Example n = 7 (ordered) 28 32 139 141 175 253 458 Example n = 8

175 28 32 139 141 253 357 458

Example n =8 (ordered)

28 32 139 141 175 253 357 458

m = 141

m = (141+175)2 = 158

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496049604971524555467586

1 5245

2 49655

3 4960

4 4971

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496052455546497155877586

1 5245

2 49655

3 5546

4 4971

Properties of Mean Median1The mean and median are unique that is a

data set has only 1 mean and 1 median (the mean and median are not necessarily equal)

2The mean uses the value of every number in the data set the median does not

14

20 4 6Ex 2 4 6 8 5 5

4 2

21 4 6Ex 2 4 6 9 5 5

4 2

x m

x m

Example class pulse rates

53 64 67 67 70 76 77 77 78 83 84 85 85 89 90 90 90 90 91 96 98 103 140

23

1

23

844823

location 12th obs 85

ii

n

xx

m m

2010 2014 baseball salaries

2010

n = 845

mean = $3297828

median = $1330000

max = $33000000

2014

n = 848

mean = $3932912

median = $1456250

max = $28000000

>

Disadvantage of the mean

Can be greatly influenced by just a few observations that are much greater or much smaller than the rest of the data

Mean Median Maximum Baseball Salaries 1985 - 201419

85

1987

1989

1991

1993

1995

1997

1999

2001

2003

2005

2007

2009

2011

2013

200000

700000

1200000

1700000

2200000

2700000

3200000

3700000

0

5000000

10000000

15000000

20000000

25000000

30000000

35000000

Baseball Salaries Mean Median and Maximum 1985-2014

Mean Median Maximum

Year

Mea

n M

edia

n S

alar

y

Max

imu

m S

alar

y

Skewness comparing the mean and median

Skewed to the right (positively skewed) meangtmedian

53

490

102 7235 21 26 17 8 10 2 3 1 0 0 1

0

100

200

300

400

500

600

Freq

uenc

y

Salary ($1000s)

2011 Baseball Salaries

Skewed to the left negatively skewed

Mean lt median mean=78 median=87

Histogram of Exam Scores

0

10

20

30

20 30 40 50 60 70 80 90 100Exam Scores

Fre

qu

en

cy

Symmetric data

mean median approx equal

Bank Customers 1000-1100 am

0

5

10

15

20

Number of Customers

Fre

qu

en

cy

Section 33Describing Variability of Data

Standard Deviation

Using the Mean and Standard Deviation Together 68-95-997

Rule (Empirical Rule)

Recall 2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability

measures how ldquospread outrdquo the data is

Ways to measure variability

1 range=largest-smallest

ok sometimes in general too crude sensitive to one large or small obs

1

2 where

the middle is the mean

deviation of from the mean

( ) sum the deviations of all the s from

measure spread from the middle

i i

n

i ii

y

y y y

y y y y

1

( ) 0 always tells us nothingn

ii

y y

Example

1 2

1 2

1 2

1 2

sum of deviations from mean

49 51 50

( ) ( ) (49 50) (51 50) 1 1 0

0 100

Data set 1

Data set 2 50

( ) ( ) (0 50) (100 50) 50 50 0

x x x

x x x x

y y y

y y y y

The Sample Standard Deviation a measure of spread around the mean Square the deviation of each

observation from the mean find the square root of the ldquoaveragerdquo of these squared deviations

2

1

2

2 1

( )sample standard deviation

1

( )is called the sample variance

1

n

ii

n

ii

y ys

n

y ys

n

Calculations hellip

Mean = 634

Sum of squared deviations from mean = 852

(n minus 1) = 13 (n minus 1) is called degrees freedom (df)

s2 = variance = 85213 = 655 square inches

s = standard deviation = radic655 = 256 inches

Women height (inches)i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

2

1

2 )(1

1xx

ns

n

i

1 First calculate the variance s22 Then take the square root to get the

standard deviation s

2

1

)(1

1xx

ns

n

i

Meanplusmn 1 sd

Wersquoll never calculate these by hand so make sure to know how to get the standard deviation using your calculator Excel or other software

Population Standard Deviation

2

1

Denoted by the lower case Greek letter

is the size (for example =34000 for NCSU)

is the mean

( )population standard deviation

va

po

lue of typically not known

us

pulation

populatio

e

n

N

ii

N N

y

N

s

to estimate value of

Remarks

1 The standard deviation of a set of measurements is an estimate of the likely size of the chance error in a single measurement

Remarks (cont)

2 Note that s and s are always greater than or equal to zero

3 The larger the value of s (or s ) the greater the spread of the data

When does s=0 When does s =0

When all data values are the same

Remarks (cont)4 The standard deviation is the most

commonly used measure of risk in finance and businessndash Stocks Mutual Funds etc

5 Variance s2 sample variance 2 population variance Units are squared units of the original data square $ square gallons

Review Properties of s and s s and s are always greater than or

equal to 0

when does s = 0 s = 0 The larger the value of s (or s) the

greater the spread of the data the standard deviation of a set of

measurements is an estimate of the likely size of the chance error in a single measurement

Summary of Notation

2

SAMPLE

sample mean

sample median

sample variance

sample stand dev

y

m

s

s

2

POPULATION

population mean

population median

population variance

population stand dev

m

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

68-95-997 rule

Mean andStandard Deviation

(numerical)

Histogram(graphical)

68-95-997 rule

The 68-95-997 ruleIf the histogram of the data is

approximately bell-shaped then1) approximately of the measurements

are of the mean

that is in ( )

2) approximately of the measurement

68

within 1 standard deviation

95

within 2 standard deviation

s

are of the meas n

that is

y s y s

almost all

within 3 standard deviation

in ( 2 2 )

3) the measurements

are of the mean

that is in ( 3 3 )

s

y s y s

y s y s

68-95-997 rule 68 within 1 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

68

3434

y-s y y+s

68-95-997 rule 95 within 2 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

95

475 475

y-2s y y+2s

Example textbook costs

37548

4272

50

y

s

n

286 291 307 308 315 316 327328 340 342 346 347 348 348 349354 355 355 360 361 364 367 369371 373 377 380 381 382 385 385387 390 390 397 398 409 409 410418 422 424 425 426 428 433 434437 440 480

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( ) (33276 41820)

32percentage of data values in this interval 64

5068-95-997 rule 68

y s

y s y s

1 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 2 2 ) (29004 46092)

48percentage of data values in this interval 96

5068-95-997 rule 95

y s

y s y s

2 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 3 3 ) (24732 50364)

50percentage of data values in this interval 100

5068-95-997 rule 997

y s

y s y s

3 standard deviation interval about the mean

The best estimate of the standard deviation of the menrsquos weights

displayed in this dotplot is

1 10

2 15

3 20

4 40

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

Preceding slides Next

Z-scores Standardized Data Values

Measures the distance of a number from the mean in units of

the standard deviation

z-score corresponding to y

where

original data value

the sample mean

s the sample standard deviation

the z-score corresponding to

y yz

s

y

y

z y

Exam 1 y1 = 88 s1 = 6 your exam 1 score 91

Exam 2 y2 = 88 s2 = 10 your exam 2 score 92

Which score is better

1

2

91 88 3z 5

6 692 88 4

z 410 10

91 on exam 1 is better than 92 on exam 2

If data has mean and standard deviation

then standardizing a particular value of

indicates how many standard deviations

is above or below the mean

y s

y

y

y

Comparing SAT and ACT Scores

SAT Math Eleanorrsquos score 680

SAT mean =500 sd=100 ACT Math Geraldrsquos score 27

ACT mean=18 sd=6 Eleanorrsquos z-score z=(680-500)100=18 Geraldrsquos z-score z=(27-18)6=15 Eleanorrsquos score is better

Z-scores add to zeroStudentInstitutional Support to Athletic Depts For the 9 Public ACC

Schools 2013 ($ millions)

School Support y - ybar Z-score

Maryland 155 64 179

UVA 131 40 112

Louisville 109 18 050

UNC 92 01 003

VaTech 79 -12 -034

FSU 79 -12 -034

GaTech 71 -20 -056

NCSU 65 -26 -073

Clemson 38 -53 -147

Mean=91000 s=35697

Sum = 0 Sum = 0

Recently the mean tuition at 4-yr public collegesuniversities in the US was $6185 with a standard deviation of $1804 In NC the mean tuition was $4320 What is NCrsquos z-score

1 103

2 -103

3 239

4 1865

5 -1865

Section 34Measures of Position (also called Measures of Relative Standing)

Quartiles

5-Number Summary

Interquartile Range Another Measure of Spread

Boxplots

m = median = 34

Q1= first quartile = 23

Q3= third quartile = 42

1 1 062 2 123 3 164 4 195 5 156 6 217 7 238 6 239 5 2510 4 2811 3 2912 2 3313 1 3414 2 3615 3 3716 4 3817 5 3918 6 4119 7 4220 6 4521 5 4722 4 4923 3 5324 2 5625 1 61

Quartiles Measuring spread by examining the middleThe first quartile Q1 is the value in the

sample that has 25 of the data at or

below it (Q1 is the median of the lower

half of the sorted data)

The third quartile Q3 is the value in the

sample that has 75 of the data at or

below it (Q3 is the median of the upper

half of the sorted data)

Quartiles and median divide data into 4 pieces

Q1 M Q3

14 14 14 14

Quartiles are common measures of spread

httpoirpncsueduiradmit

httpoirpncsueduunivpeer

University of Southern California

Economic Value of College Majors

Rules for Calculating QuartilesStep 1 find the median of all the data (the median divides the data in half)

Step 2a find the median of the lower half this median is Q1Step 2b find the median of the upper half this median is Q3

Importantwhen n is odd include the overall median in both halveswhen n is even do not include the overall median in either half

Example 2 4 6 8 10 12 14 16 18 20 n = 10

Median m = (10+12)2 = 222 = 11

Q1 median of lower half 2 4 6 8 10

Q1 = 6

Q3 median of upper half 12 14 16 18 20

Q3 = 16

11

Pulse Rates n = 138

Stem Leaves4

3 4 5889 5 00123344410 5 555678889923 6 0001111112223333334444423 6 5555666666777778888888816 7 0000011222233444423 7 5555566666677788888899910 8 000011222410 8 55556677894 9 00122 9 584 10 0223

101 11 1

Median mean of pulses in locations 69 amp 70 median= (70+70)2=70

Q1 median of lower half (lower half = 69 smallest pulses) Q1 = pulse in ordered position 35Q1 = 63

Q3 median of upper half (upper half = 69 largest pulses) Q3= pulse in position 35 from the high end Q3=78

Below are the weights of 31 linemen on the NCSU football team What is the

value of the first quartile Q1

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 287

2 2575

3 2635

4 2625

Interquartile range another measure of spread

lower quartile Q1

middle quartile median upper quartile Q3

interquartile range (IQR)

IQR = Q3 ndash Q1

measures spread of middle 50 of the data

Example beginning pulse rates

Q3 = 78 Q1 = 63

IQR = 78 ndash 63 = 15

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 27: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Statcrunch Example 2012-13 NFL Salaries

Heights of Students in Recent Stats Class (Bimodal)

ExampleGrades on a statistics exam

Data

75 66 77 66 64 73 91 65 59 86 61 86 61

58 70 77 80 58 94 78 62 79 83 54 52 45

82 48 67 55

Example-2Frequency Distribution of Grades

Class Limits Frequency40 up to 50

50 up to 60

60 up to 70

70 up to 80

80 up to 90

90 up to 100

Total

2

6

8

7

5

2

30

Example-3 Relative Frequency Distribution of Grades

Class Limits Relative Frequency40 up to 50

50 up to 60

60 up to 70

70 up to 80

80 up to 90

90 up to 100

230 = 067

630 = 200

830 = 267

730 = 233

530 = 167

230 = 067

Relative Frequency Histogram of Grades

005

10

15

20

25

30

40 50 60 70 80 90Grade

Rel

ativ

e fr

eque

ncy

100

Based on the histo-gram about what percent of the values are between 475 and 525

1 50

2 5

3 17

4 30

Stem and leaf displays Have the following general appearance

stem leaf

1 8 9

2 1 2 8 9 9

3 2 3 8 9

4 0 1

5 6 7

6 4

Example employee ages at a small company

18 21 22 19 32 33 40 41 56 57 64 28 29 29 38 39 stem 10rsquos digit leaf 1rsquos digit

18 stem=1 leaf=8 18 = 1 | 8

stem leaf

1 8 9

2 1 2 8 9 9

3 2 3 8 9

4 0 1

5 6 7

6 4

Suppose a 95 yr old is hiredstem leaf

1 8 9

2 1 2 8 9 9

3 2 3 8 9

4 0 1

5 6 7

6 4

7

8

9 5

Number of TD passes by NFL teams 2012-2013 season(stems are 10rsquos digit)

stem leaf

43

03247

2 6677789

2 01222233444

1 13467889

0 8

Pulse Rates n = 138

Stem Leaves 4 3 4 588 9 5 001233444 10 5 5556788899 23 6 00011111122233333344444 23 6 55556666667777788888888 16 7 00000112222334444 23 7 55555666666777888888999 10 8 0000112224 10 8 5555667789 4 9 0012 2 9 58 4 10 0223 10 1 11 1

AdvantagesDisadvantages of Stem-and-Leaf Displays

Advantages

1) each measurement displayed

2) ascending order in each stem row

3) relatively simple (data set not too large) Disadvantages

display becomes unwieldy for large data sets

Population of 185 US cities with between 100000 and 500000

Multiply stems by 100000

Back-to-back stem-and-leaf displays TD passes by NFL teams 1999-2000 2012-13multiply stems by 10

1999-2000 2012-13

2 4 03

6 3 7

2 3 24

6655 2 6677789

43322221100 2 01222233444

9998887666 1 67889

421 1 134

0 8

Below is a stem-and-leaf display for the pulse rates of 24 women at a health clinic How many pulses are between 67 and 77

Stems are 10rsquos digits

1 4

2 6

3 8

4 10

5 12

Other Graphical Methods for Data Time plots

plot observations in time order time on horizontal axis variable on vertical axis

Time series

measurements are taken at regular intervals (monthly unemployment quarterly GDP weather records electricity demand etc)

Heat maps word walls

Unemployment Rate by Educational Attainment

Water Use During Super Bowl XLV(Packers 31 Steelers 25)

Heat Maps

Word Wall (customer feedback)

Section 32Describing the Center of Data

Mean

Median

2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability (next section)

measures how ldquospread outrdquo the data is

Notation for Data Valuesand Sample Mean

1 2

1 2

3

The sample size is denoted by

For a variable denoted by its observations are denoted by

A common measure of center is the sample mean

The sample mean is denoted by

Shorte

n

n

y y yy

n

y

y y y y

y

n

1 21

1

ned expression for using the symbol

(uppercase Greek letter sigma)n

n

i

i n

i

i

y

y y y

yy

n

y

Simple Example of Sample Mean

Weekly TV viewing time in hours of 7 randomly selected 4th graders

19 40 16 12 10 6 and 97

1

7

1

19 40 16 12 10 6 9 112

11216

7 7

ii

ii

y

yy

Population Mean

1

population

population mea

Denoted by the Greek letter

is the size (for example =34000 for NCSU)

the value of is typically not known

we often use the sample mean

to estimat

n

e the unknown

N

ii

y

N N

y

N

value of

Connection Between Mean and Histogram

A histogram balances when supported at the mean Mean x = 1406

Histogram

0

10

20

30

40

50

60

70

118

5

125

5

132

5

139

5

146

5

153

5

16

05

Mo

re

Absences f rom Work

Fre

qu

en

cy

Frequency

The median anothermeasure of center

Given a set of n data values arranged in order of magnitude

Median= middle value n odd

mean of 2 middle values n even

Ex 2 4 6 8 10 n=5 median=6 Ex 2 4 6 8 n=4 median=(4+6)2=5

Student Pulse Rates (n=62)

38 59 60 60 62 62 63 63 64 64 65 67 68 70 70 70 70 70 70 70 71 71 72 72 73 74 74 75 75 75 75 76 77 77 77 77 78 78 79 79 80 80 80 84 84 85 85 87 90 90 91 92 93 94 94 95 96 96 96 98 98 103

Median = (75+76)2 = 755

The median splits the histogram into 2 halves of equal area

Mean balance pointMedian 50 area each half

mean 5526 years median 577years

Medians are used often

Year 2011 baseball salaries

Median $1450000 (max=$32000000 Alex Rodriguez min=$414000)

Median fan age MLB 45 NFL 43 NBA 41 NHL 39

Median existing home sales price May 2011 $166500 May 2010 $174600

Median household income (2008 dollars) 2009 $50221 2008 $52029

Examples Example n = 7

175 28 32 139 141 253 458 Example n = 7 (ordered) 28 32 139 141 175 253 458 Example n = 8

175 28 32 139 141 253 357 458

Example n =8 (ordered)

28 32 139 141 175 253 357 458

m = 141

m = (141+175)2 = 158

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496049604971524555467586

1 5245

2 49655

3 4960

4 4971

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496052455546497155877586

1 5245

2 49655

3 5546

4 4971

Properties of Mean Median1The mean and median are unique that is a

data set has only 1 mean and 1 median (the mean and median are not necessarily equal)

2The mean uses the value of every number in the data set the median does not

14

20 4 6Ex 2 4 6 8 5 5

4 2

21 4 6Ex 2 4 6 9 5 5

4 2

x m

x m

Example class pulse rates

53 64 67 67 70 76 77 77 78 83 84 85 85 89 90 90 90 90 91 96 98 103 140

23

1

23

844823

location 12th obs 85

ii

n

xx

m m

2010 2014 baseball salaries

2010

n = 845

mean = $3297828

median = $1330000

max = $33000000

2014

n = 848

mean = $3932912

median = $1456250

max = $28000000

>

Disadvantage of the mean

Can be greatly influenced by just a few observations that are much greater or much smaller than the rest of the data

Mean Median Maximum Baseball Salaries 1985 - 201419

85

1987

1989

1991

1993

1995

1997

1999

2001

2003

2005

2007

2009

2011

2013

200000

700000

1200000

1700000

2200000

2700000

3200000

3700000

0

5000000

10000000

15000000

20000000

25000000

30000000

35000000

Baseball Salaries Mean Median and Maximum 1985-2014

Mean Median Maximum

Year

Mea

n M

edia

n S

alar

y

Max

imu

m S

alar

y

Skewness comparing the mean and median

Skewed to the right (positively skewed) meangtmedian

53

490

102 7235 21 26 17 8 10 2 3 1 0 0 1

0

100

200

300

400

500

600

Freq

uenc

y

Salary ($1000s)

2011 Baseball Salaries

Skewed to the left negatively skewed

Mean lt median mean=78 median=87

Histogram of Exam Scores

0

10

20

30

20 30 40 50 60 70 80 90 100Exam Scores

Fre

qu

en

cy

Symmetric data

mean median approx equal

Bank Customers 1000-1100 am

0

5

10

15

20

Number of Customers

Fre

qu

en

cy

Section 33Describing Variability of Data

Standard Deviation

Using the Mean and Standard Deviation Together 68-95-997

Rule (Empirical Rule)

Recall 2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability

measures how ldquospread outrdquo the data is

Ways to measure variability

1 range=largest-smallest

ok sometimes in general too crude sensitive to one large or small obs

1

2 where

the middle is the mean

deviation of from the mean

( ) sum the deviations of all the s from

measure spread from the middle

i i

n

i ii

y

y y y

y y y y

1

( ) 0 always tells us nothingn

ii

y y

Example

1 2

1 2

1 2

1 2

sum of deviations from mean

49 51 50

( ) ( ) (49 50) (51 50) 1 1 0

0 100

Data set 1

Data set 2 50

( ) ( ) (0 50) (100 50) 50 50 0

x x x

x x x x

y y y

y y y y

The Sample Standard Deviation a measure of spread around the mean Square the deviation of each

observation from the mean find the square root of the ldquoaveragerdquo of these squared deviations

2

1

2

2 1

( )sample standard deviation

1

( )is called the sample variance

1

n

ii

n

ii

y ys

n

y ys

n

Calculations hellip

Mean = 634

Sum of squared deviations from mean = 852

(n minus 1) = 13 (n minus 1) is called degrees freedom (df)

s2 = variance = 85213 = 655 square inches

s = standard deviation = radic655 = 256 inches

Women height (inches)i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

2

1

2 )(1

1xx

ns

n

i

1 First calculate the variance s22 Then take the square root to get the

standard deviation s

2

1

)(1

1xx

ns

n

i

Meanplusmn 1 sd

Wersquoll never calculate these by hand so make sure to know how to get the standard deviation using your calculator Excel or other software

Population Standard Deviation

2

1

Denoted by the lower case Greek letter

is the size (for example =34000 for NCSU)

is the mean

( )population standard deviation

va

po

lue of typically not known

us

pulation

populatio

e

n

N

ii

N N

y

N

s

to estimate value of

Remarks

1 The standard deviation of a set of measurements is an estimate of the likely size of the chance error in a single measurement

Remarks (cont)

2 Note that s and s are always greater than or equal to zero

3 The larger the value of s (or s ) the greater the spread of the data

When does s=0 When does s =0

When all data values are the same

Remarks (cont)4 The standard deviation is the most

commonly used measure of risk in finance and businessndash Stocks Mutual Funds etc

5 Variance s2 sample variance 2 population variance Units are squared units of the original data square $ square gallons

Review Properties of s and s s and s are always greater than or

equal to 0

when does s = 0 s = 0 The larger the value of s (or s) the

greater the spread of the data the standard deviation of a set of

measurements is an estimate of the likely size of the chance error in a single measurement

Summary of Notation

2

SAMPLE

sample mean

sample median

sample variance

sample stand dev

y

m

s

s

2

POPULATION

population mean

population median

population variance

population stand dev

m

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

68-95-997 rule

Mean andStandard Deviation

(numerical)

Histogram(graphical)

68-95-997 rule

The 68-95-997 ruleIf the histogram of the data is

approximately bell-shaped then1) approximately of the measurements

are of the mean

that is in ( )

2) approximately of the measurement

68

within 1 standard deviation

95

within 2 standard deviation

s

are of the meas n

that is

y s y s

almost all

within 3 standard deviation

in ( 2 2 )

3) the measurements

are of the mean

that is in ( 3 3 )

s

y s y s

y s y s

68-95-997 rule 68 within 1 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

68

3434

y-s y y+s

68-95-997 rule 95 within 2 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

95

475 475

y-2s y y+2s

Example textbook costs

37548

4272

50

y

s

n

286 291 307 308 315 316 327328 340 342 346 347 348 348 349354 355 355 360 361 364 367 369371 373 377 380 381 382 385 385387 390 390 397 398 409 409 410418 422 424 425 426 428 433 434437 440 480

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( ) (33276 41820)

32percentage of data values in this interval 64

5068-95-997 rule 68

y s

y s y s

1 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 2 2 ) (29004 46092)

48percentage of data values in this interval 96

5068-95-997 rule 95

y s

y s y s

2 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 3 3 ) (24732 50364)

50percentage of data values in this interval 100

5068-95-997 rule 997

y s

y s y s

3 standard deviation interval about the mean

The best estimate of the standard deviation of the menrsquos weights

displayed in this dotplot is

1 10

2 15

3 20

4 40

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

Preceding slides Next

Z-scores Standardized Data Values

Measures the distance of a number from the mean in units of

the standard deviation

z-score corresponding to y

where

original data value

the sample mean

s the sample standard deviation

the z-score corresponding to

y yz

s

y

y

z y

Exam 1 y1 = 88 s1 = 6 your exam 1 score 91

Exam 2 y2 = 88 s2 = 10 your exam 2 score 92

Which score is better

1

2

91 88 3z 5

6 692 88 4

z 410 10

91 on exam 1 is better than 92 on exam 2

If data has mean and standard deviation

then standardizing a particular value of

indicates how many standard deviations

is above or below the mean

y s

y

y

y

Comparing SAT and ACT Scores

SAT Math Eleanorrsquos score 680

SAT mean =500 sd=100 ACT Math Geraldrsquos score 27

ACT mean=18 sd=6 Eleanorrsquos z-score z=(680-500)100=18 Geraldrsquos z-score z=(27-18)6=15 Eleanorrsquos score is better

Z-scores add to zeroStudentInstitutional Support to Athletic Depts For the 9 Public ACC

Schools 2013 ($ millions)

School Support y - ybar Z-score

Maryland 155 64 179

UVA 131 40 112

Louisville 109 18 050

UNC 92 01 003

VaTech 79 -12 -034

FSU 79 -12 -034

GaTech 71 -20 -056

NCSU 65 -26 -073

Clemson 38 -53 -147

Mean=91000 s=35697

Sum = 0 Sum = 0

Recently the mean tuition at 4-yr public collegesuniversities in the US was $6185 with a standard deviation of $1804 In NC the mean tuition was $4320 What is NCrsquos z-score

1 103

2 -103

3 239

4 1865

5 -1865

Section 34Measures of Position (also called Measures of Relative Standing)

Quartiles

5-Number Summary

Interquartile Range Another Measure of Spread

Boxplots

m = median = 34

Q1= first quartile = 23

Q3= third quartile = 42

1 1 062 2 123 3 164 4 195 5 156 6 217 7 238 6 239 5 2510 4 2811 3 2912 2 3313 1 3414 2 3615 3 3716 4 3817 5 3918 6 4119 7 4220 6 4521 5 4722 4 4923 3 5324 2 5625 1 61

Quartiles Measuring spread by examining the middleThe first quartile Q1 is the value in the

sample that has 25 of the data at or

below it (Q1 is the median of the lower

half of the sorted data)

The third quartile Q3 is the value in the

sample that has 75 of the data at or

below it (Q3 is the median of the upper

half of the sorted data)

Quartiles and median divide data into 4 pieces

Q1 M Q3

14 14 14 14

Quartiles are common measures of spread

httpoirpncsueduiradmit

httpoirpncsueduunivpeer

University of Southern California

Economic Value of College Majors

Rules for Calculating QuartilesStep 1 find the median of all the data (the median divides the data in half)

Step 2a find the median of the lower half this median is Q1Step 2b find the median of the upper half this median is Q3

Importantwhen n is odd include the overall median in both halveswhen n is even do not include the overall median in either half

Example 2 4 6 8 10 12 14 16 18 20 n = 10

Median m = (10+12)2 = 222 = 11

Q1 median of lower half 2 4 6 8 10

Q1 = 6

Q3 median of upper half 12 14 16 18 20

Q3 = 16

11

Pulse Rates n = 138

Stem Leaves4

3 4 5889 5 00123344410 5 555678889923 6 0001111112223333334444423 6 5555666666777778888888816 7 0000011222233444423 7 5555566666677788888899910 8 000011222410 8 55556677894 9 00122 9 584 10 0223

101 11 1

Median mean of pulses in locations 69 amp 70 median= (70+70)2=70

Q1 median of lower half (lower half = 69 smallest pulses) Q1 = pulse in ordered position 35Q1 = 63

Q3 median of upper half (upper half = 69 largest pulses) Q3= pulse in position 35 from the high end Q3=78

Below are the weights of 31 linemen on the NCSU football team What is the

value of the first quartile Q1

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 287

2 2575

3 2635

4 2625

Interquartile range another measure of spread

lower quartile Q1

middle quartile median upper quartile Q3

interquartile range (IQR)

IQR = Q3 ndash Q1

measures spread of middle 50 of the data

Example beginning pulse rates

Q3 = 78 Q1 = 63

IQR = 78 ndash 63 = 15

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 28: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Heights of Students in Recent Stats Class (Bimodal)

ExampleGrades on a statistics exam

Data

75 66 77 66 64 73 91 65 59 86 61 86 61

58 70 77 80 58 94 78 62 79 83 54 52 45

82 48 67 55

Example-2Frequency Distribution of Grades

Class Limits Frequency40 up to 50

50 up to 60

60 up to 70

70 up to 80

80 up to 90

90 up to 100

Total

2

6

8

7

5

2

30

Example-3 Relative Frequency Distribution of Grades

Class Limits Relative Frequency40 up to 50

50 up to 60

60 up to 70

70 up to 80

80 up to 90

90 up to 100

230 = 067

630 = 200

830 = 267

730 = 233

530 = 167

230 = 067

Relative Frequency Histogram of Grades

005

10

15

20

25

30

40 50 60 70 80 90Grade

Rel

ativ

e fr

eque

ncy

100

Based on the histo-gram about what percent of the values are between 475 and 525

1 50

2 5

3 17

4 30

Stem and leaf displays Have the following general appearance

stem leaf

1 8 9

2 1 2 8 9 9

3 2 3 8 9

4 0 1

5 6 7

6 4

Example employee ages at a small company

18 21 22 19 32 33 40 41 56 57 64 28 29 29 38 39 stem 10rsquos digit leaf 1rsquos digit

18 stem=1 leaf=8 18 = 1 | 8

stem leaf

1 8 9

2 1 2 8 9 9

3 2 3 8 9

4 0 1

5 6 7

6 4

Suppose a 95 yr old is hiredstem leaf

1 8 9

2 1 2 8 9 9

3 2 3 8 9

4 0 1

5 6 7

6 4

7

8

9 5

Number of TD passes by NFL teams 2012-2013 season(stems are 10rsquos digit)

stem leaf

43

03247

2 6677789

2 01222233444

1 13467889

0 8

Pulse Rates n = 138

Stem Leaves 4 3 4 588 9 5 001233444 10 5 5556788899 23 6 00011111122233333344444 23 6 55556666667777788888888 16 7 00000112222334444 23 7 55555666666777888888999 10 8 0000112224 10 8 5555667789 4 9 0012 2 9 58 4 10 0223 10 1 11 1

AdvantagesDisadvantages of Stem-and-Leaf Displays

Advantages

1) each measurement displayed

2) ascending order in each stem row

3) relatively simple (data set not too large) Disadvantages

display becomes unwieldy for large data sets

Population of 185 US cities with between 100000 and 500000

Multiply stems by 100000

Back-to-back stem-and-leaf displays TD passes by NFL teams 1999-2000 2012-13multiply stems by 10

1999-2000 2012-13

2 4 03

6 3 7

2 3 24

6655 2 6677789

43322221100 2 01222233444

9998887666 1 67889

421 1 134

0 8

Below is a stem-and-leaf display for the pulse rates of 24 women at a health clinic How many pulses are between 67 and 77

Stems are 10rsquos digits

1 4

2 6

3 8

4 10

5 12

Other Graphical Methods for Data Time plots

plot observations in time order time on horizontal axis variable on vertical axis

Time series

measurements are taken at regular intervals (monthly unemployment quarterly GDP weather records electricity demand etc)

Heat maps word walls

Unemployment Rate by Educational Attainment

Water Use During Super Bowl XLV(Packers 31 Steelers 25)

Heat Maps

Word Wall (customer feedback)

Section 32Describing the Center of Data

Mean

Median

2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability (next section)

measures how ldquospread outrdquo the data is

Notation for Data Valuesand Sample Mean

1 2

1 2

3

The sample size is denoted by

For a variable denoted by its observations are denoted by

A common measure of center is the sample mean

The sample mean is denoted by

Shorte

n

n

y y yy

n

y

y y y y

y

n

1 21

1

ned expression for using the symbol

(uppercase Greek letter sigma)n

n

i

i n

i

i

y

y y y

yy

n

y

Simple Example of Sample Mean

Weekly TV viewing time in hours of 7 randomly selected 4th graders

19 40 16 12 10 6 and 97

1

7

1

19 40 16 12 10 6 9 112

11216

7 7

ii

ii

y

yy

Population Mean

1

population

population mea

Denoted by the Greek letter

is the size (for example =34000 for NCSU)

the value of is typically not known

we often use the sample mean

to estimat

n

e the unknown

N

ii

y

N N

y

N

value of

Connection Between Mean and Histogram

A histogram balances when supported at the mean Mean x = 1406

Histogram

0

10

20

30

40

50

60

70

118

5

125

5

132

5

139

5

146

5

153

5

16

05

Mo

re

Absences f rom Work

Fre

qu

en

cy

Frequency

The median anothermeasure of center

Given a set of n data values arranged in order of magnitude

Median= middle value n odd

mean of 2 middle values n even

Ex 2 4 6 8 10 n=5 median=6 Ex 2 4 6 8 n=4 median=(4+6)2=5

Student Pulse Rates (n=62)

38 59 60 60 62 62 63 63 64 64 65 67 68 70 70 70 70 70 70 70 71 71 72 72 73 74 74 75 75 75 75 76 77 77 77 77 78 78 79 79 80 80 80 84 84 85 85 87 90 90 91 92 93 94 94 95 96 96 96 98 98 103

Median = (75+76)2 = 755

The median splits the histogram into 2 halves of equal area

Mean balance pointMedian 50 area each half

mean 5526 years median 577years

Medians are used often

Year 2011 baseball salaries

Median $1450000 (max=$32000000 Alex Rodriguez min=$414000)

Median fan age MLB 45 NFL 43 NBA 41 NHL 39

Median existing home sales price May 2011 $166500 May 2010 $174600

Median household income (2008 dollars) 2009 $50221 2008 $52029

Examples Example n = 7

175 28 32 139 141 253 458 Example n = 7 (ordered) 28 32 139 141 175 253 458 Example n = 8

175 28 32 139 141 253 357 458

Example n =8 (ordered)

28 32 139 141 175 253 357 458

m = 141

m = (141+175)2 = 158

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496049604971524555467586

1 5245

2 49655

3 4960

4 4971

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496052455546497155877586

1 5245

2 49655

3 5546

4 4971

Properties of Mean Median1The mean and median are unique that is a

data set has only 1 mean and 1 median (the mean and median are not necessarily equal)

2The mean uses the value of every number in the data set the median does not

14

20 4 6Ex 2 4 6 8 5 5

4 2

21 4 6Ex 2 4 6 9 5 5

4 2

x m

x m

Example class pulse rates

53 64 67 67 70 76 77 77 78 83 84 85 85 89 90 90 90 90 91 96 98 103 140

23

1

23

844823

location 12th obs 85

ii

n

xx

m m

2010 2014 baseball salaries

2010

n = 845

mean = $3297828

median = $1330000

max = $33000000

2014

n = 848

mean = $3932912

median = $1456250

max = $28000000

>

Disadvantage of the mean

Can be greatly influenced by just a few observations that are much greater or much smaller than the rest of the data

Mean Median Maximum Baseball Salaries 1985 - 201419

85

1987

1989

1991

1993

1995

1997

1999

2001

2003

2005

2007

2009

2011

2013

200000

700000

1200000

1700000

2200000

2700000

3200000

3700000

0

5000000

10000000

15000000

20000000

25000000

30000000

35000000

Baseball Salaries Mean Median and Maximum 1985-2014

Mean Median Maximum

Year

Mea

n M

edia

n S

alar

y

Max

imu

m S

alar

y

Skewness comparing the mean and median

Skewed to the right (positively skewed) meangtmedian

53

490

102 7235 21 26 17 8 10 2 3 1 0 0 1

0

100

200

300

400

500

600

Freq

uenc

y

Salary ($1000s)

2011 Baseball Salaries

Skewed to the left negatively skewed

Mean lt median mean=78 median=87

Histogram of Exam Scores

0

10

20

30

20 30 40 50 60 70 80 90 100Exam Scores

Fre

qu

en

cy

Symmetric data

mean median approx equal

Bank Customers 1000-1100 am

0

5

10

15

20

Number of Customers

Fre

qu

en

cy

Section 33Describing Variability of Data

Standard Deviation

Using the Mean and Standard Deviation Together 68-95-997

Rule (Empirical Rule)

Recall 2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability

measures how ldquospread outrdquo the data is

Ways to measure variability

1 range=largest-smallest

ok sometimes in general too crude sensitive to one large or small obs

1

2 where

the middle is the mean

deviation of from the mean

( ) sum the deviations of all the s from

measure spread from the middle

i i

n

i ii

y

y y y

y y y y

1

( ) 0 always tells us nothingn

ii

y y

Example

1 2

1 2

1 2

1 2

sum of deviations from mean

49 51 50

( ) ( ) (49 50) (51 50) 1 1 0

0 100

Data set 1

Data set 2 50

( ) ( ) (0 50) (100 50) 50 50 0

x x x

x x x x

y y y

y y y y

The Sample Standard Deviation a measure of spread around the mean Square the deviation of each

observation from the mean find the square root of the ldquoaveragerdquo of these squared deviations

2

1

2

2 1

( )sample standard deviation

1

( )is called the sample variance

1

n

ii

n

ii

y ys

n

y ys

n

Calculations hellip

Mean = 634

Sum of squared deviations from mean = 852

(n minus 1) = 13 (n minus 1) is called degrees freedom (df)

s2 = variance = 85213 = 655 square inches

s = standard deviation = radic655 = 256 inches

Women height (inches)i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

2

1

2 )(1

1xx

ns

n

i

1 First calculate the variance s22 Then take the square root to get the

standard deviation s

2

1

)(1

1xx

ns

n

i

Meanplusmn 1 sd

Wersquoll never calculate these by hand so make sure to know how to get the standard deviation using your calculator Excel or other software

Population Standard Deviation

2

1

Denoted by the lower case Greek letter

is the size (for example =34000 for NCSU)

is the mean

( )population standard deviation

va

po

lue of typically not known

us

pulation

populatio

e

n

N

ii

N N

y

N

s

to estimate value of

Remarks

1 The standard deviation of a set of measurements is an estimate of the likely size of the chance error in a single measurement

Remarks (cont)

2 Note that s and s are always greater than or equal to zero

3 The larger the value of s (or s ) the greater the spread of the data

When does s=0 When does s =0

When all data values are the same

Remarks (cont)4 The standard deviation is the most

commonly used measure of risk in finance and businessndash Stocks Mutual Funds etc

5 Variance s2 sample variance 2 population variance Units are squared units of the original data square $ square gallons

Review Properties of s and s s and s are always greater than or

equal to 0

when does s = 0 s = 0 The larger the value of s (or s) the

greater the spread of the data the standard deviation of a set of

measurements is an estimate of the likely size of the chance error in a single measurement

Summary of Notation

2

SAMPLE

sample mean

sample median

sample variance

sample stand dev

y

m

s

s

2

POPULATION

population mean

population median

population variance

population stand dev

m

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

68-95-997 rule

Mean andStandard Deviation

(numerical)

Histogram(graphical)

68-95-997 rule

The 68-95-997 ruleIf the histogram of the data is

approximately bell-shaped then1) approximately of the measurements

are of the mean

that is in ( )

2) approximately of the measurement

68

within 1 standard deviation

95

within 2 standard deviation

s

are of the meas n

that is

y s y s

almost all

within 3 standard deviation

in ( 2 2 )

3) the measurements

are of the mean

that is in ( 3 3 )

s

y s y s

y s y s

68-95-997 rule 68 within 1 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

68

3434

y-s y y+s

68-95-997 rule 95 within 2 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

95

475 475

y-2s y y+2s

Example textbook costs

37548

4272

50

y

s

n

286 291 307 308 315 316 327328 340 342 346 347 348 348 349354 355 355 360 361 364 367 369371 373 377 380 381 382 385 385387 390 390 397 398 409 409 410418 422 424 425 426 428 433 434437 440 480

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( ) (33276 41820)

32percentage of data values in this interval 64

5068-95-997 rule 68

y s

y s y s

1 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 2 2 ) (29004 46092)

48percentage of data values in this interval 96

5068-95-997 rule 95

y s

y s y s

2 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 3 3 ) (24732 50364)

50percentage of data values in this interval 100

5068-95-997 rule 997

y s

y s y s

3 standard deviation interval about the mean

The best estimate of the standard deviation of the menrsquos weights

displayed in this dotplot is

1 10

2 15

3 20

4 40

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

Preceding slides Next

Z-scores Standardized Data Values

Measures the distance of a number from the mean in units of

the standard deviation

z-score corresponding to y

where

original data value

the sample mean

s the sample standard deviation

the z-score corresponding to

y yz

s

y

y

z y

Exam 1 y1 = 88 s1 = 6 your exam 1 score 91

Exam 2 y2 = 88 s2 = 10 your exam 2 score 92

Which score is better

1

2

91 88 3z 5

6 692 88 4

z 410 10

91 on exam 1 is better than 92 on exam 2

If data has mean and standard deviation

then standardizing a particular value of

indicates how many standard deviations

is above or below the mean

y s

y

y

y

Comparing SAT and ACT Scores

SAT Math Eleanorrsquos score 680

SAT mean =500 sd=100 ACT Math Geraldrsquos score 27

ACT mean=18 sd=6 Eleanorrsquos z-score z=(680-500)100=18 Geraldrsquos z-score z=(27-18)6=15 Eleanorrsquos score is better

Z-scores add to zeroStudentInstitutional Support to Athletic Depts For the 9 Public ACC

Schools 2013 ($ millions)

School Support y - ybar Z-score

Maryland 155 64 179

UVA 131 40 112

Louisville 109 18 050

UNC 92 01 003

VaTech 79 -12 -034

FSU 79 -12 -034

GaTech 71 -20 -056

NCSU 65 -26 -073

Clemson 38 -53 -147

Mean=91000 s=35697

Sum = 0 Sum = 0

Recently the mean tuition at 4-yr public collegesuniversities in the US was $6185 with a standard deviation of $1804 In NC the mean tuition was $4320 What is NCrsquos z-score

1 103

2 -103

3 239

4 1865

5 -1865

Section 34Measures of Position (also called Measures of Relative Standing)

Quartiles

5-Number Summary

Interquartile Range Another Measure of Spread

Boxplots

m = median = 34

Q1= first quartile = 23

Q3= third quartile = 42

1 1 062 2 123 3 164 4 195 5 156 6 217 7 238 6 239 5 2510 4 2811 3 2912 2 3313 1 3414 2 3615 3 3716 4 3817 5 3918 6 4119 7 4220 6 4521 5 4722 4 4923 3 5324 2 5625 1 61

Quartiles Measuring spread by examining the middleThe first quartile Q1 is the value in the

sample that has 25 of the data at or

below it (Q1 is the median of the lower

half of the sorted data)

The third quartile Q3 is the value in the

sample that has 75 of the data at or

below it (Q3 is the median of the upper

half of the sorted data)

Quartiles and median divide data into 4 pieces

Q1 M Q3

14 14 14 14

Quartiles are common measures of spread

httpoirpncsueduiradmit

httpoirpncsueduunivpeer

University of Southern California

Economic Value of College Majors

Rules for Calculating QuartilesStep 1 find the median of all the data (the median divides the data in half)

Step 2a find the median of the lower half this median is Q1Step 2b find the median of the upper half this median is Q3

Importantwhen n is odd include the overall median in both halveswhen n is even do not include the overall median in either half

Example 2 4 6 8 10 12 14 16 18 20 n = 10

Median m = (10+12)2 = 222 = 11

Q1 median of lower half 2 4 6 8 10

Q1 = 6

Q3 median of upper half 12 14 16 18 20

Q3 = 16

11

Pulse Rates n = 138

Stem Leaves4

3 4 5889 5 00123344410 5 555678889923 6 0001111112223333334444423 6 5555666666777778888888816 7 0000011222233444423 7 5555566666677788888899910 8 000011222410 8 55556677894 9 00122 9 584 10 0223

101 11 1

Median mean of pulses in locations 69 amp 70 median= (70+70)2=70

Q1 median of lower half (lower half = 69 smallest pulses) Q1 = pulse in ordered position 35Q1 = 63

Q3 median of upper half (upper half = 69 largest pulses) Q3= pulse in position 35 from the high end Q3=78

Below are the weights of 31 linemen on the NCSU football team What is the

value of the first quartile Q1

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 287

2 2575

3 2635

4 2625

Interquartile range another measure of spread

lower quartile Q1

middle quartile median upper quartile Q3

interquartile range (IQR)

IQR = Q3 ndash Q1

measures spread of middle 50 of the data

Example beginning pulse rates

Q3 = 78 Q1 = 63

IQR = 78 ndash 63 = 15

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 29: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

ExampleGrades on a statistics exam

Data

75 66 77 66 64 73 91 65 59 86 61 86 61

58 70 77 80 58 94 78 62 79 83 54 52 45

82 48 67 55

Example-2Frequency Distribution of Grades

Class Limits Frequency40 up to 50

50 up to 60

60 up to 70

70 up to 80

80 up to 90

90 up to 100

Total

2

6

8

7

5

2

30

Example-3 Relative Frequency Distribution of Grades

Class Limits Relative Frequency40 up to 50

50 up to 60

60 up to 70

70 up to 80

80 up to 90

90 up to 100

230 = 067

630 = 200

830 = 267

730 = 233

530 = 167

230 = 067

Relative Frequency Histogram of Grades

005

10

15

20

25

30

40 50 60 70 80 90Grade

Rel

ativ

e fr

eque

ncy

100

Based on the histo-gram about what percent of the values are between 475 and 525

1 50

2 5

3 17

4 30

Stem and leaf displays Have the following general appearance

stem leaf

1 8 9

2 1 2 8 9 9

3 2 3 8 9

4 0 1

5 6 7

6 4

Example employee ages at a small company

18 21 22 19 32 33 40 41 56 57 64 28 29 29 38 39 stem 10rsquos digit leaf 1rsquos digit

18 stem=1 leaf=8 18 = 1 | 8

stem leaf

1 8 9

2 1 2 8 9 9

3 2 3 8 9

4 0 1

5 6 7

6 4

Suppose a 95 yr old is hiredstem leaf

1 8 9

2 1 2 8 9 9

3 2 3 8 9

4 0 1

5 6 7

6 4

7

8

9 5

Number of TD passes by NFL teams 2012-2013 season(stems are 10rsquos digit)

stem leaf

43

03247

2 6677789

2 01222233444

1 13467889

0 8

Pulse Rates n = 138

Stem Leaves 4 3 4 588 9 5 001233444 10 5 5556788899 23 6 00011111122233333344444 23 6 55556666667777788888888 16 7 00000112222334444 23 7 55555666666777888888999 10 8 0000112224 10 8 5555667789 4 9 0012 2 9 58 4 10 0223 10 1 11 1

AdvantagesDisadvantages of Stem-and-Leaf Displays

Advantages

1) each measurement displayed

2) ascending order in each stem row

3) relatively simple (data set not too large) Disadvantages

display becomes unwieldy for large data sets

Population of 185 US cities with between 100000 and 500000

Multiply stems by 100000

Back-to-back stem-and-leaf displays TD passes by NFL teams 1999-2000 2012-13multiply stems by 10

1999-2000 2012-13

2 4 03

6 3 7

2 3 24

6655 2 6677789

43322221100 2 01222233444

9998887666 1 67889

421 1 134

0 8

Below is a stem-and-leaf display for the pulse rates of 24 women at a health clinic How many pulses are between 67 and 77

Stems are 10rsquos digits

1 4

2 6

3 8

4 10

5 12

Other Graphical Methods for Data Time plots

plot observations in time order time on horizontal axis variable on vertical axis

Time series

measurements are taken at regular intervals (monthly unemployment quarterly GDP weather records electricity demand etc)

Heat maps word walls

Unemployment Rate by Educational Attainment

Water Use During Super Bowl XLV(Packers 31 Steelers 25)

Heat Maps

Word Wall (customer feedback)

Section 32Describing the Center of Data

Mean

Median

2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability (next section)

measures how ldquospread outrdquo the data is

Notation for Data Valuesand Sample Mean

1 2

1 2

3

The sample size is denoted by

For a variable denoted by its observations are denoted by

A common measure of center is the sample mean

The sample mean is denoted by

Shorte

n

n

y y yy

n

y

y y y y

y

n

1 21

1

ned expression for using the symbol

(uppercase Greek letter sigma)n

n

i

i n

i

i

y

y y y

yy

n

y

Simple Example of Sample Mean

Weekly TV viewing time in hours of 7 randomly selected 4th graders

19 40 16 12 10 6 and 97

1

7

1

19 40 16 12 10 6 9 112

11216

7 7

ii

ii

y

yy

Population Mean

1

population

population mea

Denoted by the Greek letter

is the size (for example =34000 for NCSU)

the value of is typically not known

we often use the sample mean

to estimat

n

e the unknown

N

ii

y

N N

y

N

value of

Connection Between Mean and Histogram

A histogram balances when supported at the mean Mean x = 1406

Histogram

0

10

20

30

40

50

60

70

118

5

125

5

132

5

139

5

146

5

153

5

16

05

Mo

re

Absences f rom Work

Fre

qu

en

cy

Frequency

The median anothermeasure of center

Given a set of n data values arranged in order of magnitude

Median= middle value n odd

mean of 2 middle values n even

Ex 2 4 6 8 10 n=5 median=6 Ex 2 4 6 8 n=4 median=(4+6)2=5

Student Pulse Rates (n=62)

38 59 60 60 62 62 63 63 64 64 65 67 68 70 70 70 70 70 70 70 71 71 72 72 73 74 74 75 75 75 75 76 77 77 77 77 78 78 79 79 80 80 80 84 84 85 85 87 90 90 91 92 93 94 94 95 96 96 96 98 98 103

Median = (75+76)2 = 755

The median splits the histogram into 2 halves of equal area

Mean balance pointMedian 50 area each half

mean 5526 years median 577years

Medians are used often

Year 2011 baseball salaries

Median $1450000 (max=$32000000 Alex Rodriguez min=$414000)

Median fan age MLB 45 NFL 43 NBA 41 NHL 39

Median existing home sales price May 2011 $166500 May 2010 $174600

Median household income (2008 dollars) 2009 $50221 2008 $52029

Examples Example n = 7

175 28 32 139 141 253 458 Example n = 7 (ordered) 28 32 139 141 175 253 458 Example n = 8

175 28 32 139 141 253 357 458

Example n =8 (ordered)

28 32 139 141 175 253 357 458

m = 141

m = (141+175)2 = 158

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496049604971524555467586

1 5245

2 49655

3 4960

4 4971

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496052455546497155877586

1 5245

2 49655

3 5546

4 4971

Properties of Mean Median1The mean and median are unique that is a

data set has only 1 mean and 1 median (the mean and median are not necessarily equal)

2The mean uses the value of every number in the data set the median does not

14

20 4 6Ex 2 4 6 8 5 5

4 2

21 4 6Ex 2 4 6 9 5 5

4 2

x m

x m

Example class pulse rates

53 64 67 67 70 76 77 77 78 83 84 85 85 89 90 90 90 90 91 96 98 103 140

23

1

23

844823

location 12th obs 85

ii

n

xx

m m

2010 2014 baseball salaries

2010

n = 845

mean = $3297828

median = $1330000

max = $33000000

2014

n = 848

mean = $3932912

median = $1456250

max = $28000000

>

Disadvantage of the mean

Can be greatly influenced by just a few observations that are much greater or much smaller than the rest of the data

Mean Median Maximum Baseball Salaries 1985 - 201419

85

1987

1989

1991

1993

1995

1997

1999

2001

2003

2005

2007

2009

2011

2013

200000

700000

1200000

1700000

2200000

2700000

3200000

3700000

0

5000000

10000000

15000000

20000000

25000000

30000000

35000000

Baseball Salaries Mean Median and Maximum 1985-2014

Mean Median Maximum

Year

Mea

n M

edia

n S

alar

y

Max

imu

m S

alar

y

Skewness comparing the mean and median

Skewed to the right (positively skewed) meangtmedian

53

490

102 7235 21 26 17 8 10 2 3 1 0 0 1

0

100

200

300

400

500

600

Freq

uenc

y

Salary ($1000s)

2011 Baseball Salaries

Skewed to the left negatively skewed

Mean lt median mean=78 median=87

Histogram of Exam Scores

0

10

20

30

20 30 40 50 60 70 80 90 100Exam Scores

Fre

qu

en

cy

Symmetric data

mean median approx equal

Bank Customers 1000-1100 am

0

5

10

15

20

Number of Customers

Fre

qu

en

cy

Section 33Describing Variability of Data

Standard Deviation

Using the Mean and Standard Deviation Together 68-95-997

Rule (Empirical Rule)

Recall 2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability

measures how ldquospread outrdquo the data is

Ways to measure variability

1 range=largest-smallest

ok sometimes in general too crude sensitive to one large or small obs

1

2 where

the middle is the mean

deviation of from the mean

( ) sum the deviations of all the s from

measure spread from the middle

i i

n

i ii

y

y y y

y y y y

1

( ) 0 always tells us nothingn

ii

y y

Example

1 2

1 2

1 2

1 2

sum of deviations from mean

49 51 50

( ) ( ) (49 50) (51 50) 1 1 0

0 100

Data set 1

Data set 2 50

( ) ( ) (0 50) (100 50) 50 50 0

x x x

x x x x

y y y

y y y y

The Sample Standard Deviation a measure of spread around the mean Square the deviation of each

observation from the mean find the square root of the ldquoaveragerdquo of these squared deviations

2

1

2

2 1

( )sample standard deviation

1

( )is called the sample variance

1

n

ii

n

ii

y ys

n

y ys

n

Calculations hellip

Mean = 634

Sum of squared deviations from mean = 852

(n minus 1) = 13 (n minus 1) is called degrees freedom (df)

s2 = variance = 85213 = 655 square inches

s = standard deviation = radic655 = 256 inches

Women height (inches)i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

2

1

2 )(1

1xx

ns

n

i

1 First calculate the variance s22 Then take the square root to get the

standard deviation s

2

1

)(1

1xx

ns

n

i

Meanplusmn 1 sd

Wersquoll never calculate these by hand so make sure to know how to get the standard deviation using your calculator Excel or other software

Population Standard Deviation

2

1

Denoted by the lower case Greek letter

is the size (for example =34000 for NCSU)

is the mean

( )population standard deviation

va

po

lue of typically not known

us

pulation

populatio

e

n

N

ii

N N

y

N

s

to estimate value of

Remarks

1 The standard deviation of a set of measurements is an estimate of the likely size of the chance error in a single measurement

Remarks (cont)

2 Note that s and s are always greater than or equal to zero

3 The larger the value of s (or s ) the greater the spread of the data

When does s=0 When does s =0

When all data values are the same

Remarks (cont)4 The standard deviation is the most

commonly used measure of risk in finance and businessndash Stocks Mutual Funds etc

5 Variance s2 sample variance 2 population variance Units are squared units of the original data square $ square gallons

Review Properties of s and s s and s are always greater than or

equal to 0

when does s = 0 s = 0 The larger the value of s (or s) the

greater the spread of the data the standard deviation of a set of

measurements is an estimate of the likely size of the chance error in a single measurement

Summary of Notation

2

SAMPLE

sample mean

sample median

sample variance

sample stand dev

y

m

s

s

2

POPULATION

population mean

population median

population variance

population stand dev

m

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

68-95-997 rule

Mean andStandard Deviation

(numerical)

Histogram(graphical)

68-95-997 rule

The 68-95-997 ruleIf the histogram of the data is

approximately bell-shaped then1) approximately of the measurements

are of the mean

that is in ( )

2) approximately of the measurement

68

within 1 standard deviation

95

within 2 standard deviation

s

are of the meas n

that is

y s y s

almost all

within 3 standard deviation

in ( 2 2 )

3) the measurements

are of the mean

that is in ( 3 3 )

s

y s y s

y s y s

68-95-997 rule 68 within 1 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

68

3434

y-s y y+s

68-95-997 rule 95 within 2 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

95

475 475

y-2s y y+2s

Example textbook costs

37548

4272

50

y

s

n

286 291 307 308 315 316 327328 340 342 346 347 348 348 349354 355 355 360 361 364 367 369371 373 377 380 381 382 385 385387 390 390 397 398 409 409 410418 422 424 425 426 428 433 434437 440 480

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( ) (33276 41820)

32percentage of data values in this interval 64

5068-95-997 rule 68

y s

y s y s

1 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 2 2 ) (29004 46092)

48percentage of data values in this interval 96

5068-95-997 rule 95

y s

y s y s

2 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 3 3 ) (24732 50364)

50percentage of data values in this interval 100

5068-95-997 rule 997

y s

y s y s

3 standard deviation interval about the mean

The best estimate of the standard deviation of the menrsquos weights

displayed in this dotplot is

1 10

2 15

3 20

4 40

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

Preceding slides Next

Z-scores Standardized Data Values

Measures the distance of a number from the mean in units of

the standard deviation

z-score corresponding to y

where

original data value

the sample mean

s the sample standard deviation

the z-score corresponding to

y yz

s

y

y

z y

Exam 1 y1 = 88 s1 = 6 your exam 1 score 91

Exam 2 y2 = 88 s2 = 10 your exam 2 score 92

Which score is better

1

2

91 88 3z 5

6 692 88 4

z 410 10

91 on exam 1 is better than 92 on exam 2

If data has mean and standard deviation

then standardizing a particular value of

indicates how many standard deviations

is above or below the mean

y s

y

y

y

Comparing SAT and ACT Scores

SAT Math Eleanorrsquos score 680

SAT mean =500 sd=100 ACT Math Geraldrsquos score 27

ACT mean=18 sd=6 Eleanorrsquos z-score z=(680-500)100=18 Geraldrsquos z-score z=(27-18)6=15 Eleanorrsquos score is better

Z-scores add to zeroStudentInstitutional Support to Athletic Depts For the 9 Public ACC

Schools 2013 ($ millions)

School Support y - ybar Z-score

Maryland 155 64 179

UVA 131 40 112

Louisville 109 18 050

UNC 92 01 003

VaTech 79 -12 -034

FSU 79 -12 -034

GaTech 71 -20 -056

NCSU 65 -26 -073

Clemson 38 -53 -147

Mean=91000 s=35697

Sum = 0 Sum = 0

Recently the mean tuition at 4-yr public collegesuniversities in the US was $6185 with a standard deviation of $1804 In NC the mean tuition was $4320 What is NCrsquos z-score

1 103

2 -103

3 239

4 1865

5 -1865

Section 34Measures of Position (also called Measures of Relative Standing)

Quartiles

5-Number Summary

Interquartile Range Another Measure of Spread

Boxplots

m = median = 34

Q1= first quartile = 23

Q3= third quartile = 42

1 1 062 2 123 3 164 4 195 5 156 6 217 7 238 6 239 5 2510 4 2811 3 2912 2 3313 1 3414 2 3615 3 3716 4 3817 5 3918 6 4119 7 4220 6 4521 5 4722 4 4923 3 5324 2 5625 1 61

Quartiles Measuring spread by examining the middleThe first quartile Q1 is the value in the

sample that has 25 of the data at or

below it (Q1 is the median of the lower

half of the sorted data)

The third quartile Q3 is the value in the

sample that has 75 of the data at or

below it (Q3 is the median of the upper

half of the sorted data)

Quartiles and median divide data into 4 pieces

Q1 M Q3

14 14 14 14

Quartiles are common measures of spread

httpoirpncsueduiradmit

httpoirpncsueduunivpeer

University of Southern California

Economic Value of College Majors

Rules for Calculating QuartilesStep 1 find the median of all the data (the median divides the data in half)

Step 2a find the median of the lower half this median is Q1Step 2b find the median of the upper half this median is Q3

Importantwhen n is odd include the overall median in both halveswhen n is even do not include the overall median in either half

Example 2 4 6 8 10 12 14 16 18 20 n = 10

Median m = (10+12)2 = 222 = 11

Q1 median of lower half 2 4 6 8 10

Q1 = 6

Q3 median of upper half 12 14 16 18 20

Q3 = 16

11

Pulse Rates n = 138

Stem Leaves4

3 4 5889 5 00123344410 5 555678889923 6 0001111112223333334444423 6 5555666666777778888888816 7 0000011222233444423 7 5555566666677788888899910 8 000011222410 8 55556677894 9 00122 9 584 10 0223

101 11 1

Median mean of pulses in locations 69 amp 70 median= (70+70)2=70

Q1 median of lower half (lower half = 69 smallest pulses) Q1 = pulse in ordered position 35Q1 = 63

Q3 median of upper half (upper half = 69 largest pulses) Q3= pulse in position 35 from the high end Q3=78

Below are the weights of 31 linemen on the NCSU football team What is the

value of the first quartile Q1

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 287

2 2575

3 2635

4 2625

Interquartile range another measure of spread

lower quartile Q1

middle quartile median upper quartile Q3

interquartile range (IQR)

IQR = Q3 ndash Q1

measures spread of middle 50 of the data

Example beginning pulse rates

Q3 = 78 Q1 = 63

IQR = 78 ndash 63 = 15

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 30: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Example-2Frequency Distribution of Grades

Class Limits Frequency40 up to 50

50 up to 60

60 up to 70

70 up to 80

80 up to 90

90 up to 100

Total

2

6

8

7

5

2

30

Example-3 Relative Frequency Distribution of Grades

Class Limits Relative Frequency40 up to 50

50 up to 60

60 up to 70

70 up to 80

80 up to 90

90 up to 100

230 = 067

630 = 200

830 = 267

730 = 233

530 = 167

230 = 067

Relative Frequency Histogram of Grades

005

10

15

20

25

30

40 50 60 70 80 90Grade

Rel

ativ

e fr

eque

ncy

100

Based on the histo-gram about what percent of the values are between 475 and 525

1 50

2 5

3 17

4 30

Stem and leaf displays Have the following general appearance

stem leaf

1 8 9

2 1 2 8 9 9

3 2 3 8 9

4 0 1

5 6 7

6 4

Example employee ages at a small company

18 21 22 19 32 33 40 41 56 57 64 28 29 29 38 39 stem 10rsquos digit leaf 1rsquos digit

18 stem=1 leaf=8 18 = 1 | 8

stem leaf

1 8 9

2 1 2 8 9 9

3 2 3 8 9

4 0 1

5 6 7

6 4

Suppose a 95 yr old is hiredstem leaf

1 8 9

2 1 2 8 9 9

3 2 3 8 9

4 0 1

5 6 7

6 4

7

8

9 5

Number of TD passes by NFL teams 2012-2013 season(stems are 10rsquos digit)

stem leaf

43

03247

2 6677789

2 01222233444

1 13467889

0 8

Pulse Rates n = 138

Stem Leaves 4 3 4 588 9 5 001233444 10 5 5556788899 23 6 00011111122233333344444 23 6 55556666667777788888888 16 7 00000112222334444 23 7 55555666666777888888999 10 8 0000112224 10 8 5555667789 4 9 0012 2 9 58 4 10 0223 10 1 11 1

AdvantagesDisadvantages of Stem-and-Leaf Displays

Advantages

1) each measurement displayed

2) ascending order in each stem row

3) relatively simple (data set not too large) Disadvantages

display becomes unwieldy for large data sets

Population of 185 US cities with between 100000 and 500000

Multiply stems by 100000

Back-to-back stem-and-leaf displays TD passes by NFL teams 1999-2000 2012-13multiply stems by 10

1999-2000 2012-13

2 4 03

6 3 7

2 3 24

6655 2 6677789

43322221100 2 01222233444

9998887666 1 67889

421 1 134

0 8

Below is a stem-and-leaf display for the pulse rates of 24 women at a health clinic How many pulses are between 67 and 77

Stems are 10rsquos digits

1 4

2 6

3 8

4 10

5 12

Other Graphical Methods for Data Time plots

plot observations in time order time on horizontal axis variable on vertical axis

Time series

measurements are taken at regular intervals (monthly unemployment quarterly GDP weather records electricity demand etc)

Heat maps word walls

Unemployment Rate by Educational Attainment

Water Use During Super Bowl XLV(Packers 31 Steelers 25)

Heat Maps

Word Wall (customer feedback)

Section 32Describing the Center of Data

Mean

Median

2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability (next section)

measures how ldquospread outrdquo the data is

Notation for Data Valuesand Sample Mean

1 2

1 2

3

The sample size is denoted by

For a variable denoted by its observations are denoted by

A common measure of center is the sample mean

The sample mean is denoted by

Shorte

n

n

y y yy

n

y

y y y y

y

n

1 21

1

ned expression for using the symbol

(uppercase Greek letter sigma)n

n

i

i n

i

i

y

y y y

yy

n

y

Simple Example of Sample Mean

Weekly TV viewing time in hours of 7 randomly selected 4th graders

19 40 16 12 10 6 and 97

1

7

1

19 40 16 12 10 6 9 112

11216

7 7

ii

ii

y

yy

Population Mean

1

population

population mea

Denoted by the Greek letter

is the size (for example =34000 for NCSU)

the value of is typically not known

we often use the sample mean

to estimat

n

e the unknown

N

ii

y

N N

y

N

value of

Connection Between Mean and Histogram

A histogram balances when supported at the mean Mean x = 1406

Histogram

0

10

20

30

40

50

60

70

118

5

125

5

132

5

139

5

146

5

153

5

16

05

Mo

re

Absences f rom Work

Fre

qu

en

cy

Frequency

The median anothermeasure of center

Given a set of n data values arranged in order of magnitude

Median= middle value n odd

mean of 2 middle values n even

Ex 2 4 6 8 10 n=5 median=6 Ex 2 4 6 8 n=4 median=(4+6)2=5

Student Pulse Rates (n=62)

38 59 60 60 62 62 63 63 64 64 65 67 68 70 70 70 70 70 70 70 71 71 72 72 73 74 74 75 75 75 75 76 77 77 77 77 78 78 79 79 80 80 80 84 84 85 85 87 90 90 91 92 93 94 94 95 96 96 96 98 98 103

Median = (75+76)2 = 755

The median splits the histogram into 2 halves of equal area

Mean balance pointMedian 50 area each half

mean 5526 years median 577years

Medians are used often

Year 2011 baseball salaries

Median $1450000 (max=$32000000 Alex Rodriguez min=$414000)

Median fan age MLB 45 NFL 43 NBA 41 NHL 39

Median existing home sales price May 2011 $166500 May 2010 $174600

Median household income (2008 dollars) 2009 $50221 2008 $52029

Examples Example n = 7

175 28 32 139 141 253 458 Example n = 7 (ordered) 28 32 139 141 175 253 458 Example n = 8

175 28 32 139 141 253 357 458

Example n =8 (ordered)

28 32 139 141 175 253 357 458

m = 141

m = (141+175)2 = 158

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496049604971524555467586

1 5245

2 49655

3 4960

4 4971

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496052455546497155877586

1 5245

2 49655

3 5546

4 4971

Properties of Mean Median1The mean and median are unique that is a

data set has only 1 mean and 1 median (the mean and median are not necessarily equal)

2The mean uses the value of every number in the data set the median does not

14

20 4 6Ex 2 4 6 8 5 5

4 2

21 4 6Ex 2 4 6 9 5 5

4 2

x m

x m

Example class pulse rates

53 64 67 67 70 76 77 77 78 83 84 85 85 89 90 90 90 90 91 96 98 103 140

23

1

23

844823

location 12th obs 85

ii

n

xx

m m

2010 2014 baseball salaries

2010

n = 845

mean = $3297828

median = $1330000

max = $33000000

2014

n = 848

mean = $3932912

median = $1456250

max = $28000000

>

Disadvantage of the mean

Can be greatly influenced by just a few observations that are much greater or much smaller than the rest of the data

Mean Median Maximum Baseball Salaries 1985 - 201419

85

1987

1989

1991

1993

1995

1997

1999

2001

2003

2005

2007

2009

2011

2013

200000

700000

1200000

1700000

2200000

2700000

3200000

3700000

0

5000000

10000000

15000000

20000000

25000000

30000000

35000000

Baseball Salaries Mean Median and Maximum 1985-2014

Mean Median Maximum

Year

Mea

n M

edia

n S

alar

y

Max

imu

m S

alar

y

Skewness comparing the mean and median

Skewed to the right (positively skewed) meangtmedian

53

490

102 7235 21 26 17 8 10 2 3 1 0 0 1

0

100

200

300

400

500

600

Freq

uenc

y

Salary ($1000s)

2011 Baseball Salaries

Skewed to the left negatively skewed

Mean lt median mean=78 median=87

Histogram of Exam Scores

0

10

20

30

20 30 40 50 60 70 80 90 100Exam Scores

Fre

qu

en

cy

Symmetric data

mean median approx equal

Bank Customers 1000-1100 am

0

5

10

15

20

Number of Customers

Fre

qu

en

cy

Section 33Describing Variability of Data

Standard Deviation

Using the Mean and Standard Deviation Together 68-95-997

Rule (Empirical Rule)

Recall 2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability

measures how ldquospread outrdquo the data is

Ways to measure variability

1 range=largest-smallest

ok sometimes in general too crude sensitive to one large or small obs

1

2 where

the middle is the mean

deviation of from the mean

( ) sum the deviations of all the s from

measure spread from the middle

i i

n

i ii

y

y y y

y y y y

1

( ) 0 always tells us nothingn

ii

y y

Example

1 2

1 2

1 2

1 2

sum of deviations from mean

49 51 50

( ) ( ) (49 50) (51 50) 1 1 0

0 100

Data set 1

Data set 2 50

( ) ( ) (0 50) (100 50) 50 50 0

x x x

x x x x

y y y

y y y y

The Sample Standard Deviation a measure of spread around the mean Square the deviation of each

observation from the mean find the square root of the ldquoaveragerdquo of these squared deviations

2

1

2

2 1

( )sample standard deviation

1

( )is called the sample variance

1

n

ii

n

ii

y ys

n

y ys

n

Calculations hellip

Mean = 634

Sum of squared deviations from mean = 852

(n minus 1) = 13 (n minus 1) is called degrees freedom (df)

s2 = variance = 85213 = 655 square inches

s = standard deviation = radic655 = 256 inches

Women height (inches)i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

2

1

2 )(1

1xx

ns

n

i

1 First calculate the variance s22 Then take the square root to get the

standard deviation s

2

1

)(1

1xx

ns

n

i

Meanplusmn 1 sd

Wersquoll never calculate these by hand so make sure to know how to get the standard deviation using your calculator Excel or other software

Population Standard Deviation

2

1

Denoted by the lower case Greek letter

is the size (for example =34000 for NCSU)

is the mean

( )population standard deviation

va

po

lue of typically not known

us

pulation

populatio

e

n

N

ii

N N

y

N

s

to estimate value of

Remarks

1 The standard deviation of a set of measurements is an estimate of the likely size of the chance error in a single measurement

Remarks (cont)

2 Note that s and s are always greater than or equal to zero

3 The larger the value of s (or s ) the greater the spread of the data

When does s=0 When does s =0

When all data values are the same

Remarks (cont)4 The standard deviation is the most

commonly used measure of risk in finance and businessndash Stocks Mutual Funds etc

5 Variance s2 sample variance 2 population variance Units are squared units of the original data square $ square gallons

Review Properties of s and s s and s are always greater than or

equal to 0

when does s = 0 s = 0 The larger the value of s (or s) the

greater the spread of the data the standard deviation of a set of

measurements is an estimate of the likely size of the chance error in a single measurement

Summary of Notation

2

SAMPLE

sample mean

sample median

sample variance

sample stand dev

y

m

s

s

2

POPULATION

population mean

population median

population variance

population stand dev

m

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

68-95-997 rule

Mean andStandard Deviation

(numerical)

Histogram(graphical)

68-95-997 rule

The 68-95-997 ruleIf the histogram of the data is

approximately bell-shaped then1) approximately of the measurements

are of the mean

that is in ( )

2) approximately of the measurement

68

within 1 standard deviation

95

within 2 standard deviation

s

are of the meas n

that is

y s y s

almost all

within 3 standard deviation

in ( 2 2 )

3) the measurements

are of the mean

that is in ( 3 3 )

s

y s y s

y s y s

68-95-997 rule 68 within 1 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

68

3434

y-s y y+s

68-95-997 rule 95 within 2 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

95

475 475

y-2s y y+2s

Example textbook costs

37548

4272

50

y

s

n

286 291 307 308 315 316 327328 340 342 346 347 348 348 349354 355 355 360 361 364 367 369371 373 377 380 381 382 385 385387 390 390 397 398 409 409 410418 422 424 425 426 428 433 434437 440 480

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( ) (33276 41820)

32percentage of data values in this interval 64

5068-95-997 rule 68

y s

y s y s

1 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 2 2 ) (29004 46092)

48percentage of data values in this interval 96

5068-95-997 rule 95

y s

y s y s

2 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 3 3 ) (24732 50364)

50percentage of data values in this interval 100

5068-95-997 rule 997

y s

y s y s

3 standard deviation interval about the mean

The best estimate of the standard deviation of the menrsquos weights

displayed in this dotplot is

1 10

2 15

3 20

4 40

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

Preceding slides Next

Z-scores Standardized Data Values

Measures the distance of a number from the mean in units of

the standard deviation

z-score corresponding to y

where

original data value

the sample mean

s the sample standard deviation

the z-score corresponding to

y yz

s

y

y

z y

Exam 1 y1 = 88 s1 = 6 your exam 1 score 91

Exam 2 y2 = 88 s2 = 10 your exam 2 score 92

Which score is better

1

2

91 88 3z 5

6 692 88 4

z 410 10

91 on exam 1 is better than 92 on exam 2

If data has mean and standard deviation

then standardizing a particular value of

indicates how many standard deviations

is above or below the mean

y s

y

y

y

Comparing SAT and ACT Scores

SAT Math Eleanorrsquos score 680

SAT mean =500 sd=100 ACT Math Geraldrsquos score 27

ACT mean=18 sd=6 Eleanorrsquos z-score z=(680-500)100=18 Geraldrsquos z-score z=(27-18)6=15 Eleanorrsquos score is better

Z-scores add to zeroStudentInstitutional Support to Athletic Depts For the 9 Public ACC

Schools 2013 ($ millions)

School Support y - ybar Z-score

Maryland 155 64 179

UVA 131 40 112

Louisville 109 18 050

UNC 92 01 003

VaTech 79 -12 -034

FSU 79 -12 -034

GaTech 71 -20 -056

NCSU 65 -26 -073

Clemson 38 -53 -147

Mean=91000 s=35697

Sum = 0 Sum = 0

Recently the mean tuition at 4-yr public collegesuniversities in the US was $6185 with a standard deviation of $1804 In NC the mean tuition was $4320 What is NCrsquos z-score

1 103

2 -103

3 239

4 1865

5 -1865

Section 34Measures of Position (also called Measures of Relative Standing)

Quartiles

5-Number Summary

Interquartile Range Another Measure of Spread

Boxplots

m = median = 34

Q1= first quartile = 23

Q3= third quartile = 42

1 1 062 2 123 3 164 4 195 5 156 6 217 7 238 6 239 5 2510 4 2811 3 2912 2 3313 1 3414 2 3615 3 3716 4 3817 5 3918 6 4119 7 4220 6 4521 5 4722 4 4923 3 5324 2 5625 1 61

Quartiles Measuring spread by examining the middleThe first quartile Q1 is the value in the

sample that has 25 of the data at or

below it (Q1 is the median of the lower

half of the sorted data)

The third quartile Q3 is the value in the

sample that has 75 of the data at or

below it (Q3 is the median of the upper

half of the sorted data)

Quartiles and median divide data into 4 pieces

Q1 M Q3

14 14 14 14

Quartiles are common measures of spread

httpoirpncsueduiradmit

httpoirpncsueduunivpeer

University of Southern California

Economic Value of College Majors

Rules for Calculating QuartilesStep 1 find the median of all the data (the median divides the data in half)

Step 2a find the median of the lower half this median is Q1Step 2b find the median of the upper half this median is Q3

Importantwhen n is odd include the overall median in both halveswhen n is even do not include the overall median in either half

Example 2 4 6 8 10 12 14 16 18 20 n = 10

Median m = (10+12)2 = 222 = 11

Q1 median of lower half 2 4 6 8 10

Q1 = 6

Q3 median of upper half 12 14 16 18 20

Q3 = 16

11

Pulse Rates n = 138

Stem Leaves4

3 4 5889 5 00123344410 5 555678889923 6 0001111112223333334444423 6 5555666666777778888888816 7 0000011222233444423 7 5555566666677788888899910 8 000011222410 8 55556677894 9 00122 9 584 10 0223

101 11 1

Median mean of pulses in locations 69 amp 70 median= (70+70)2=70

Q1 median of lower half (lower half = 69 smallest pulses) Q1 = pulse in ordered position 35Q1 = 63

Q3 median of upper half (upper half = 69 largest pulses) Q3= pulse in position 35 from the high end Q3=78

Below are the weights of 31 linemen on the NCSU football team What is the

value of the first quartile Q1

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 287

2 2575

3 2635

4 2625

Interquartile range another measure of spread

lower quartile Q1

middle quartile median upper quartile Q3

interquartile range (IQR)

IQR = Q3 ndash Q1

measures spread of middle 50 of the data

Example beginning pulse rates

Q3 = 78 Q1 = 63

IQR = 78 ndash 63 = 15

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 31: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Example-3 Relative Frequency Distribution of Grades

Class Limits Relative Frequency40 up to 50

50 up to 60

60 up to 70

70 up to 80

80 up to 90

90 up to 100

230 = 067

630 = 200

830 = 267

730 = 233

530 = 167

230 = 067

Relative Frequency Histogram of Grades

005

10

15

20

25

30

40 50 60 70 80 90Grade

Rel

ativ

e fr

eque

ncy

100

Based on the histo-gram about what percent of the values are between 475 and 525

1 50

2 5

3 17

4 30

Stem and leaf displays Have the following general appearance

stem leaf

1 8 9

2 1 2 8 9 9

3 2 3 8 9

4 0 1

5 6 7

6 4

Example employee ages at a small company

18 21 22 19 32 33 40 41 56 57 64 28 29 29 38 39 stem 10rsquos digit leaf 1rsquos digit

18 stem=1 leaf=8 18 = 1 | 8

stem leaf

1 8 9

2 1 2 8 9 9

3 2 3 8 9

4 0 1

5 6 7

6 4

Suppose a 95 yr old is hiredstem leaf

1 8 9

2 1 2 8 9 9

3 2 3 8 9

4 0 1

5 6 7

6 4

7

8

9 5

Number of TD passes by NFL teams 2012-2013 season(stems are 10rsquos digit)

stem leaf

43

03247

2 6677789

2 01222233444

1 13467889

0 8

Pulse Rates n = 138

Stem Leaves 4 3 4 588 9 5 001233444 10 5 5556788899 23 6 00011111122233333344444 23 6 55556666667777788888888 16 7 00000112222334444 23 7 55555666666777888888999 10 8 0000112224 10 8 5555667789 4 9 0012 2 9 58 4 10 0223 10 1 11 1

AdvantagesDisadvantages of Stem-and-Leaf Displays

Advantages

1) each measurement displayed

2) ascending order in each stem row

3) relatively simple (data set not too large) Disadvantages

display becomes unwieldy for large data sets

Population of 185 US cities with between 100000 and 500000

Multiply stems by 100000

Back-to-back stem-and-leaf displays TD passes by NFL teams 1999-2000 2012-13multiply stems by 10

1999-2000 2012-13

2 4 03

6 3 7

2 3 24

6655 2 6677789

43322221100 2 01222233444

9998887666 1 67889

421 1 134

0 8

Below is a stem-and-leaf display for the pulse rates of 24 women at a health clinic How many pulses are between 67 and 77

Stems are 10rsquos digits

1 4

2 6

3 8

4 10

5 12

Other Graphical Methods for Data Time plots

plot observations in time order time on horizontal axis variable on vertical axis

Time series

measurements are taken at regular intervals (monthly unemployment quarterly GDP weather records electricity demand etc)

Heat maps word walls

Unemployment Rate by Educational Attainment

Water Use During Super Bowl XLV(Packers 31 Steelers 25)

Heat Maps

Word Wall (customer feedback)

Section 32Describing the Center of Data

Mean

Median

2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability (next section)

measures how ldquospread outrdquo the data is

Notation for Data Valuesand Sample Mean

1 2

1 2

3

The sample size is denoted by

For a variable denoted by its observations are denoted by

A common measure of center is the sample mean

The sample mean is denoted by

Shorte

n

n

y y yy

n

y

y y y y

y

n

1 21

1

ned expression for using the symbol

(uppercase Greek letter sigma)n

n

i

i n

i

i

y

y y y

yy

n

y

Simple Example of Sample Mean

Weekly TV viewing time in hours of 7 randomly selected 4th graders

19 40 16 12 10 6 and 97

1

7

1

19 40 16 12 10 6 9 112

11216

7 7

ii

ii

y

yy

Population Mean

1

population

population mea

Denoted by the Greek letter

is the size (for example =34000 for NCSU)

the value of is typically not known

we often use the sample mean

to estimat

n

e the unknown

N

ii

y

N N

y

N

value of

Connection Between Mean and Histogram

A histogram balances when supported at the mean Mean x = 1406

Histogram

0

10

20

30

40

50

60

70

118

5

125

5

132

5

139

5

146

5

153

5

16

05

Mo

re

Absences f rom Work

Fre

qu

en

cy

Frequency

The median anothermeasure of center

Given a set of n data values arranged in order of magnitude

Median= middle value n odd

mean of 2 middle values n even

Ex 2 4 6 8 10 n=5 median=6 Ex 2 4 6 8 n=4 median=(4+6)2=5

Student Pulse Rates (n=62)

38 59 60 60 62 62 63 63 64 64 65 67 68 70 70 70 70 70 70 70 71 71 72 72 73 74 74 75 75 75 75 76 77 77 77 77 78 78 79 79 80 80 80 84 84 85 85 87 90 90 91 92 93 94 94 95 96 96 96 98 98 103

Median = (75+76)2 = 755

The median splits the histogram into 2 halves of equal area

Mean balance pointMedian 50 area each half

mean 5526 years median 577years

Medians are used often

Year 2011 baseball salaries

Median $1450000 (max=$32000000 Alex Rodriguez min=$414000)

Median fan age MLB 45 NFL 43 NBA 41 NHL 39

Median existing home sales price May 2011 $166500 May 2010 $174600

Median household income (2008 dollars) 2009 $50221 2008 $52029

Examples Example n = 7

175 28 32 139 141 253 458 Example n = 7 (ordered) 28 32 139 141 175 253 458 Example n = 8

175 28 32 139 141 253 357 458

Example n =8 (ordered)

28 32 139 141 175 253 357 458

m = 141

m = (141+175)2 = 158

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496049604971524555467586

1 5245

2 49655

3 4960

4 4971

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496052455546497155877586

1 5245

2 49655

3 5546

4 4971

Properties of Mean Median1The mean and median are unique that is a

data set has only 1 mean and 1 median (the mean and median are not necessarily equal)

2The mean uses the value of every number in the data set the median does not

14

20 4 6Ex 2 4 6 8 5 5

4 2

21 4 6Ex 2 4 6 9 5 5

4 2

x m

x m

Example class pulse rates

53 64 67 67 70 76 77 77 78 83 84 85 85 89 90 90 90 90 91 96 98 103 140

23

1

23

844823

location 12th obs 85

ii

n

xx

m m

2010 2014 baseball salaries

2010

n = 845

mean = $3297828

median = $1330000

max = $33000000

2014

n = 848

mean = $3932912

median = $1456250

max = $28000000

>

Disadvantage of the mean

Can be greatly influenced by just a few observations that are much greater or much smaller than the rest of the data

Mean Median Maximum Baseball Salaries 1985 - 201419

85

1987

1989

1991

1993

1995

1997

1999

2001

2003

2005

2007

2009

2011

2013

200000

700000

1200000

1700000

2200000

2700000

3200000

3700000

0

5000000

10000000

15000000

20000000

25000000

30000000

35000000

Baseball Salaries Mean Median and Maximum 1985-2014

Mean Median Maximum

Year

Mea

n M

edia

n S

alar

y

Max

imu

m S

alar

y

Skewness comparing the mean and median

Skewed to the right (positively skewed) meangtmedian

53

490

102 7235 21 26 17 8 10 2 3 1 0 0 1

0

100

200

300

400

500

600

Freq

uenc

y

Salary ($1000s)

2011 Baseball Salaries

Skewed to the left negatively skewed

Mean lt median mean=78 median=87

Histogram of Exam Scores

0

10

20

30

20 30 40 50 60 70 80 90 100Exam Scores

Fre

qu

en

cy

Symmetric data

mean median approx equal

Bank Customers 1000-1100 am

0

5

10

15

20

Number of Customers

Fre

qu

en

cy

Section 33Describing Variability of Data

Standard Deviation

Using the Mean and Standard Deviation Together 68-95-997

Rule (Empirical Rule)

Recall 2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability

measures how ldquospread outrdquo the data is

Ways to measure variability

1 range=largest-smallest

ok sometimes in general too crude sensitive to one large or small obs

1

2 where

the middle is the mean

deviation of from the mean

( ) sum the deviations of all the s from

measure spread from the middle

i i

n

i ii

y

y y y

y y y y

1

( ) 0 always tells us nothingn

ii

y y

Example

1 2

1 2

1 2

1 2

sum of deviations from mean

49 51 50

( ) ( ) (49 50) (51 50) 1 1 0

0 100

Data set 1

Data set 2 50

( ) ( ) (0 50) (100 50) 50 50 0

x x x

x x x x

y y y

y y y y

The Sample Standard Deviation a measure of spread around the mean Square the deviation of each

observation from the mean find the square root of the ldquoaveragerdquo of these squared deviations

2

1

2

2 1

( )sample standard deviation

1

( )is called the sample variance

1

n

ii

n

ii

y ys

n

y ys

n

Calculations hellip

Mean = 634

Sum of squared deviations from mean = 852

(n minus 1) = 13 (n minus 1) is called degrees freedom (df)

s2 = variance = 85213 = 655 square inches

s = standard deviation = radic655 = 256 inches

Women height (inches)i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

2

1

2 )(1

1xx

ns

n

i

1 First calculate the variance s22 Then take the square root to get the

standard deviation s

2

1

)(1

1xx

ns

n

i

Meanplusmn 1 sd

Wersquoll never calculate these by hand so make sure to know how to get the standard deviation using your calculator Excel or other software

Population Standard Deviation

2

1

Denoted by the lower case Greek letter

is the size (for example =34000 for NCSU)

is the mean

( )population standard deviation

va

po

lue of typically not known

us

pulation

populatio

e

n

N

ii

N N

y

N

s

to estimate value of

Remarks

1 The standard deviation of a set of measurements is an estimate of the likely size of the chance error in a single measurement

Remarks (cont)

2 Note that s and s are always greater than or equal to zero

3 The larger the value of s (or s ) the greater the spread of the data

When does s=0 When does s =0

When all data values are the same

Remarks (cont)4 The standard deviation is the most

commonly used measure of risk in finance and businessndash Stocks Mutual Funds etc

5 Variance s2 sample variance 2 population variance Units are squared units of the original data square $ square gallons

Review Properties of s and s s and s are always greater than or

equal to 0

when does s = 0 s = 0 The larger the value of s (or s) the

greater the spread of the data the standard deviation of a set of

measurements is an estimate of the likely size of the chance error in a single measurement

Summary of Notation

2

SAMPLE

sample mean

sample median

sample variance

sample stand dev

y

m

s

s

2

POPULATION

population mean

population median

population variance

population stand dev

m

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

68-95-997 rule

Mean andStandard Deviation

(numerical)

Histogram(graphical)

68-95-997 rule

The 68-95-997 ruleIf the histogram of the data is

approximately bell-shaped then1) approximately of the measurements

are of the mean

that is in ( )

2) approximately of the measurement

68

within 1 standard deviation

95

within 2 standard deviation

s

are of the meas n

that is

y s y s

almost all

within 3 standard deviation

in ( 2 2 )

3) the measurements

are of the mean

that is in ( 3 3 )

s

y s y s

y s y s

68-95-997 rule 68 within 1 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

68

3434

y-s y y+s

68-95-997 rule 95 within 2 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

95

475 475

y-2s y y+2s

Example textbook costs

37548

4272

50

y

s

n

286 291 307 308 315 316 327328 340 342 346 347 348 348 349354 355 355 360 361 364 367 369371 373 377 380 381 382 385 385387 390 390 397 398 409 409 410418 422 424 425 426 428 433 434437 440 480

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( ) (33276 41820)

32percentage of data values in this interval 64

5068-95-997 rule 68

y s

y s y s

1 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 2 2 ) (29004 46092)

48percentage of data values in this interval 96

5068-95-997 rule 95

y s

y s y s

2 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 3 3 ) (24732 50364)

50percentage of data values in this interval 100

5068-95-997 rule 997

y s

y s y s

3 standard deviation interval about the mean

The best estimate of the standard deviation of the menrsquos weights

displayed in this dotplot is

1 10

2 15

3 20

4 40

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

Preceding slides Next

Z-scores Standardized Data Values

Measures the distance of a number from the mean in units of

the standard deviation

z-score corresponding to y

where

original data value

the sample mean

s the sample standard deviation

the z-score corresponding to

y yz

s

y

y

z y

Exam 1 y1 = 88 s1 = 6 your exam 1 score 91

Exam 2 y2 = 88 s2 = 10 your exam 2 score 92

Which score is better

1

2

91 88 3z 5

6 692 88 4

z 410 10

91 on exam 1 is better than 92 on exam 2

If data has mean and standard deviation

then standardizing a particular value of

indicates how many standard deviations

is above or below the mean

y s

y

y

y

Comparing SAT and ACT Scores

SAT Math Eleanorrsquos score 680

SAT mean =500 sd=100 ACT Math Geraldrsquos score 27

ACT mean=18 sd=6 Eleanorrsquos z-score z=(680-500)100=18 Geraldrsquos z-score z=(27-18)6=15 Eleanorrsquos score is better

Z-scores add to zeroStudentInstitutional Support to Athletic Depts For the 9 Public ACC

Schools 2013 ($ millions)

School Support y - ybar Z-score

Maryland 155 64 179

UVA 131 40 112

Louisville 109 18 050

UNC 92 01 003

VaTech 79 -12 -034

FSU 79 -12 -034

GaTech 71 -20 -056

NCSU 65 -26 -073

Clemson 38 -53 -147

Mean=91000 s=35697

Sum = 0 Sum = 0

Recently the mean tuition at 4-yr public collegesuniversities in the US was $6185 with a standard deviation of $1804 In NC the mean tuition was $4320 What is NCrsquos z-score

1 103

2 -103

3 239

4 1865

5 -1865

Section 34Measures of Position (also called Measures of Relative Standing)

Quartiles

5-Number Summary

Interquartile Range Another Measure of Spread

Boxplots

m = median = 34

Q1= first quartile = 23

Q3= third quartile = 42

1 1 062 2 123 3 164 4 195 5 156 6 217 7 238 6 239 5 2510 4 2811 3 2912 2 3313 1 3414 2 3615 3 3716 4 3817 5 3918 6 4119 7 4220 6 4521 5 4722 4 4923 3 5324 2 5625 1 61

Quartiles Measuring spread by examining the middleThe first quartile Q1 is the value in the

sample that has 25 of the data at or

below it (Q1 is the median of the lower

half of the sorted data)

The third quartile Q3 is the value in the

sample that has 75 of the data at or

below it (Q3 is the median of the upper

half of the sorted data)

Quartiles and median divide data into 4 pieces

Q1 M Q3

14 14 14 14

Quartiles are common measures of spread

httpoirpncsueduiradmit

httpoirpncsueduunivpeer

University of Southern California

Economic Value of College Majors

Rules for Calculating QuartilesStep 1 find the median of all the data (the median divides the data in half)

Step 2a find the median of the lower half this median is Q1Step 2b find the median of the upper half this median is Q3

Importantwhen n is odd include the overall median in both halveswhen n is even do not include the overall median in either half

Example 2 4 6 8 10 12 14 16 18 20 n = 10

Median m = (10+12)2 = 222 = 11

Q1 median of lower half 2 4 6 8 10

Q1 = 6

Q3 median of upper half 12 14 16 18 20

Q3 = 16

11

Pulse Rates n = 138

Stem Leaves4

3 4 5889 5 00123344410 5 555678889923 6 0001111112223333334444423 6 5555666666777778888888816 7 0000011222233444423 7 5555566666677788888899910 8 000011222410 8 55556677894 9 00122 9 584 10 0223

101 11 1

Median mean of pulses in locations 69 amp 70 median= (70+70)2=70

Q1 median of lower half (lower half = 69 smallest pulses) Q1 = pulse in ordered position 35Q1 = 63

Q3 median of upper half (upper half = 69 largest pulses) Q3= pulse in position 35 from the high end Q3=78

Below are the weights of 31 linemen on the NCSU football team What is the

value of the first quartile Q1

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 287

2 2575

3 2635

4 2625

Interquartile range another measure of spread

lower quartile Q1

middle quartile median upper quartile Q3

interquartile range (IQR)

IQR = Q3 ndash Q1

measures spread of middle 50 of the data

Example beginning pulse rates

Q3 = 78 Q1 = 63

IQR = 78 ndash 63 = 15

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 32: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Relative Frequency Histogram of Grades

005

10

15

20

25

30

40 50 60 70 80 90Grade

Rel

ativ

e fr

eque

ncy

100

Based on the histo-gram about what percent of the values are between 475 and 525

1 50

2 5

3 17

4 30

Stem and leaf displays Have the following general appearance

stem leaf

1 8 9

2 1 2 8 9 9

3 2 3 8 9

4 0 1

5 6 7

6 4

Example employee ages at a small company

18 21 22 19 32 33 40 41 56 57 64 28 29 29 38 39 stem 10rsquos digit leaf 1rsquos digit

18 stem=1 leaf=8 18 = 1 | 8

stem leaf

1 8 9

2 1 2 8 9 9

3 2 3 8 9

4 0 1

5 6 7

6 4

Suppose a 95 yr old is hiredstem leaf

1 8 9

2 1 2 8 9 9

3 2 3 8 9

4 0 1

5 6 7

6 4

7

8

9 5

Number of TD passes by NFL teams 2012-2013 season(stems are 10rsquos digit)

stem leaf

43

03247

2 6677789

2 01222233444

1 13467889

0 8

Pulse Rates n = 138

Stem Leaves 4 3 4 588 9 5 001233444 10 5 5556788899 23 6 00011111122233333344444 23 6 55556666667777788888888 16 7 00000112222334444 23 7 55555666666777888888999 10 8 0000112224 10 8 5555667789 4 9 0012 2 9 58 4 10 0223 10 1 11 1

AdvantagesDisadvantages of Stem-and-Leaf Displays

Advantages

1) each measurement displayed

2) ascending order in each stem row

3) relatively simple (data set not too large) Disadvantages

display becomes unwieldy for large data sets

Population of 185 US cities with between 100000 and 500000

Multiply stems by 100000

Back-to-back stem-and-leaf displays TD passes by NFL teams 1999-2000 2012-13multiply stems by 10

1999-2000 2012-13

2 4 03

6 3 7

2 3 24

6655 2 6677789

43322221100 2 01222233444

9998887666 1 67889

421 1 134

0 8

Below is a stem-and-leaf display for the pulse rates of 24 women at a health clinic How many pulses are between 67 and 77

Stems are 10rsquos digits

1 4

2 6

3 8

4 10

5 12

Other Graphical Methods for Data Time plots

plot observations in time order time on horizontal axis variable on vertical axis

Time series

measurements are taken at regular intervals (monthly unemployment quarterly GDP weather records electricity demand etc)

Heat maps word walls

Unemployment Rate by Educational Attainment

Water Use During Super Bowl XLV(Packers 31 Steelers 25)

Heat Maps

Word Wall (customer feedback)

Section 32Describing the Center of Data

Mean

Median

2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability (next section)

measures how ldquospread outrdquo the data is

Notation for Data Valuesand Sample Mean

1 2

1 2

3

The sample size is denoted by

For a variable denoted by its observations are denoted by

A common measure of center is the sample mean

The sample mean is denoted by

Shorte

n

n

y y yy

n

y

y y y y

y

n

1 21

1

ned expression for using the symbol

(uppercase Greek letter sigma)n

n

i

i n

i

i

y

y y y

yy

n

y

Simple Example of Sample Mean

Weekly TV viewing time in hours of 7 randomly selected 4th graders

19 40 16 12 10 6 and 97

1

7

1

19 40 16 12 10 6 9 112

11216

7 7

ii

ii

y

yy

Population Mean

1

population

population mea

Denoted by the Greek letter

is the size (for example =34000 for NCSU)

the value of is typically not known

we often use the sample mean

to estimat

n

e the unknown

N

ii

y

N N

y

N

value of

Connection Between Mean and Histogram

A histogram balances when supported at the mean Mean x = 1406

Histogram

0

10

20

30

40

50

60

70

118

5

125

5

132

5

139

5

146

5

153

5

16

05

Mo

re

Absences f rom Work

Fre

qu

en

cy

Frequency

The median anothermeasure of center

Given a set of n data values arranged in order of magnitude

Median= middle value n odd

mean of 2 middle values n even

Ex 2 4 6 8 10 n=5 median=6 Ex 2 4 6 8 n=4 median=(4+6)2=5

Student Pulse Rates (n=62)

38 59 60 60 62 62 63 63 64 64 65 67 68 70 70 70 70 70 70 70 71 71 72 72 73 74 74 75 75 75 75 76 77 77 77 77 78 78 79 79 80 80 80 84 84 85 85 87 90 90 91 92 93 94 94 95 96 96 96 98 98 103

Median = (75+76)2 = 755

The median splits the histogram into 2 halves of equal area

Mean balance pointMedian 50 area each half

mean 5526 years median 577years

Medians are used often

Year 2011 baseball salaries

Median $1450000 (max=$32000000 Alex Rodriguez min=$414000)

Median fan age MLB 45 NFL 43 NBA 41 NHL 39

Median existing home sales price May 2011 $166500 May 2010 $174600

Median household income (2008 dollars) 2009 $50221 2008 $52029

Examples Example n = 7

175 28 32 139 141 253 458 Example n = 7 (ordered) 28 32 139 141 175 253 458 Example n = 8

175 28 32 139 141 253 357 458

Example n =8 (ordered)

28 32 139 141 175 253 357 458

m = 141

m = (141+175)2 = 158

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496049604971524555467586

1 5245

2 49655

3 4960

4 4971

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496052455546497155877586

1 5245

2 49655

3 5546

4 4971

Properties of Mean Median1The mean and median are unique that is a

data set has only 1 mean and 1 median (the mean and median are not necessarily equal)

2The mean uses the value of every number in the data set the median does not

14

20 4 6Ex 2 4 6 8 5 5

4 2

21 4 6Ex 2 4 6 9 5 5

4 2

x m

x m

Example class pulse rates

53 64 67 67 70 76 77 77 78 83 84 85 85 89 90 90 90 90 91 96 98 103 140

23

1

23

844823

location 12th obs 85

ii

n

xx

m m

2010 2014 baseball salaries

2010

n = 845

mean = $3297828

median = $1330000

max = $33000000

2014

n = 848

mean = $3932912

median = $1456250

max = $28000000

>

Disadvantage of the mean

Can be greatly influenced by just a few observations that are much greater or much smaller than the rest of the data

Mean Median Maximum Baseball Salaries 1985 - 201419

85

1987

1989

1991

1993

1995

1997

1999

2001

2003

2005

2007

2009

2011

2013

200000

700000

1200000

1700000

2200000

2700000

3200000

3700000

0

5000000

10000000

15000000

20000000

25000000

30000000

35000000

Baseball Salaries Mean Median and Maximum 1985-2014

Mean Median Maximum

Year

Mea

n M

edia

n S

alar

y

Max

imu

m S

alar

y

Skewness comparing the mean and median

Skewed to the right (positively skewed) meangtmedian

53

490

102 7235 21 26 17 8 10 2 3 1 0 0 1

0

100

200

300

400

500

600

Freq

uenc

y

Salary ($1000s)

2011 Baseball Salaries

Skewed to the left negatively skewed

Mean lt median mean=78 median=87

Histogram of Exam Scores

0

10

20

30

20 30 40 50 60 70 80 90 100Exam Scores

Fre

qu

en

cy

Symmetric data

mean median approx equal

Bank Customers 1000-1100 am

0

5

10

15

20

Number of Customers

Fre

qu

en

cy

Section 33Describing Variability of Data

Standard Deviation

Using the Mean and Standard Deviation Together 68-95-997

Rule (Empirical Rule)

Recall 2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability

measures how ldquospread outrdquo the data is

Ways to measure variability

1 range=largest-smallest

ok sometimes in general too crude sensitive to one large or small obs

1

2 where

the middle is the mean

deviation of from the mean

( ) sum the deviations of all the s from

measure spread from the middle

i i

n

i ii

y

y y y

y y y y

1

( ) 0 always tells us nothingn

ii

y y

Example

1 2

1 2

1 2

1 2

sum of deviations from mean

49 51 50

( ) ( ) (49 50) (51 50) 1 1 0

0 100

Data set 1

Data set 2 50

( ) ( ) (0 50) (100 50) 50 50 0

x x x

x x x x

y y y

y y y y

The Sample Standard Deviation a measure of spread around the mean Square the deviation of each

observation from the mean find the square root of the ldquoaveragerdquo of these squared deviations

2

1

2

2 1

( )sample standard deviation

1

( )is called the sample variance

1

n

ii

n

ii

y ys

n

y ys

n

Calculations hellip

Mean = 634

Sum of squared deviations from mean = 852

(n minus 1) = 13 (n minus 1) is called degrees freedom (df)

s2 = variance = 85213 = 655 square inches

s = standard deviation = radic655 = 256 inches

Women height (inches)i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

2

1

2 )(1

1xx

ns

n

i

1 First calculate the variance s22 Then take the square root to get the

standard deviation s

2

1

)(1

1xx

ns

n

i

Meanplusmn 1 sd

Wersquoll never calculate these by hand so make sure to know how to get the standard deviation using your calculator Excel or other software

Population Standard Deviation

2

1

Denoted by the lower case Greek letter

is the size (for example =34000 for NCSU)

is the mean

( )population standard deviation

va

po

lue of typically not known

us

pulation

populatio

e

n

N

ii

N N

y

N

s

to estimate value of

Remarks

1 The standard deviation of a set of measurements is an estimate of the likely size of the chance error in a single measurement

Remarks (cont)

2 Note that s and s are always greater than or equal to zero

3 The larger the value of s (or s ) the greater the spread of the data

When does s=0 When does s =0

When all data values are the same

Remarks (cont)4 The standard deviation is the most

commonly used measure of risk in finance and businessndash Stocks Mutual Funds etc

5 Variance s2 sample variance 2 population variance Units are squared units of the original data square $ square gallons

Review Properties of s and s s and s are always greater than or

equal to 0

when does s = 0 s = 0 The larger the value of s (or s) the

greater the spread of the data the standard deviation of a set of

measurements is an estimate of the likely size of the chance error in a single measurement

Summary of Notation

2

SAMPLE

sample mean

sample median

sample variance

sample stand dev

y

m

s

s

2

POPULATION

population mean

population median

population variance

population stand dev

m

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

68-95-997 rule

Mean andStandard Deviation

(numerical)

Histogram(graphical)

68-95-997 rule

The 68-95-997 ruleIf the histogram of the data is

approximately bell-shaped then1) approximately of the measurements

are of the mean

that is in ( )

2) approximately of the measurement

68

within 1 standard deviation

95

within 2 standard deviation

s

are of the meas n

that is

y s y s

almost all

within 3 standard deviation

in ( 2 2 )

3) the measurements

are of the mean

that is in ( 3 3 )

s

y s y s

y s y s

68-95-997 rule 68 within 1 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

68

3434

y-s y y+s

68-95-997 rule 95 within 2 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

95

475 475

y-2s y y+2s

Example textbook costs

37548

4272

50

y

s

n

286 291 307 308 315 316 327328 340 342 346 347 348 348 349354 355 355 360 361 364 367 369371 373 377 380 381 382 385 385387 390 390 397 398 409 409 410418 422 424 425 426 428 433 434437 440 480

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( ) (33276 41820)

32percentage of data values in this interval 64

5068-95-997 rule 68

y s

y s y s

1 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 2 2 ) (29004 46092)

48percentage of data values in this interval 96

5068-95-997 rule 95

y s

y s y s

2 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 3 3 ) (24732 50364)

50percentage of data values in this interval 100

5068-95-997 rule 997

y s

y s y s

3 standard deviation interval about the mean

The best estimate of the standard deviation of the menrsquos weights

displayed in this dotplot is

1 10

2 15

3 20

4 40

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

Preceding slides Next

Z-scores Standardized Data Values

Measures the distance of a number from the mean in units of

the standard deviation

z-score corresponding to y

where

original data value

the sample mean

s the sample standard deviation

the z-score corresponding to

y yz

s

y

y

z y

Exam 1 y1 = 88 s1 = 6 your exam 1 score 91

Exam 2 y2 = 88 s2 = 10 your exam 2 score 92

Which score is better

1

2

91 88 3z 5

6 692 88 4

z 410 10

91 on exam 1 is better than 92 on exam 2

If data has mean and standard deviation

then standardizing a particular value of

indicates how many standard deviations

is above or below the mean

y s

y

y

y

Comparing SAT and ACT Scores

SAT Math Eleanorrsquos score 680

SAT mean =500 sd=100 ACT Math Geraldrsquos score 27

ACT mean=18 sd=6 Eleanorrsquos z-score z=(680-500)100=18 Geraldrsquos z-score z=(27-18)6=15 Eleanorrsquos score is better

Z-scores add to zeroStudentInstitutional Support to Athletic Depts For the 9 Public ACC

Schools 2013 ($ millions)

School Support y - ybar Z-score

Maryland 155 64 179

UVA 131 40 112

Louisville 109 18 050

UNC 92 01 003

VaTech 79 -12 -034

FSU 79 -12 -034

GaTech 71 -20 -056

NCSU 65 -26 -073

Clemson 38 -53 -147

Mean=91000 s=35697

Sum = 0 Sum = 0

Recently the mean tuition at 4-yr public collegesuniversities in the US was $6185 with a standard deviation of $1804 In NC the mean tuition was $4320 What is NCrsquos z-score

1 103

2 -103

3 239

4 1865

5 -1865

Section 34Measures of Position (also called Measures of Relative Standing)

Quartiles

5-Number Summary

Interquartile Range Another Measure of Spread

Boxplots

m = median = 34

Q1= first quartile = 23

Q3= third quartile = 42

1 1 062 2 123 3 164 4 195 5 156 6 217 7 238 6 239 5 2510 4 2811 3 2912 2 3313 1 3414 2 3615 3 3716 4 3817 5 3918 6 4119 7 4220 6 4521 5 4722 4 4923 3 5324 2 5625 1 61

Quartiles Measuring spread by examining the middleThe first quartile Q1 is the value in the

sample that has 25 of the data at or

below it (Q1 is the median of the lower

half of the sorted data)

The third quartile Q3 is the value in the

sample that has 75 of the data at or

below it (Q3 is the median of the upper

half of the sorted data)

Quartiles and median divide data into 4 pieces

Q1 M Q3

14 14 14 14

Quartiles are common measures of spread

httpoirpncsueduiradmit

httpoirpncsueduunivpeer

University of Southern California

Economic Value of College Majors

Rules for Calculating QuartilesStep 1 find the median of all the data (the median divides the data in half)

Step 2a find the median of the lower half this median is Q1Step 2b find the median of the upper half this median is Q3

Importantwhen n is odd include the overall median in both halveswhen n is even do not include the overall median in either half

Example 2 4 6 8 10 12 14 16 18 20 n = 10

Median m = (10+12)2 = 222 = 11

Q1 median of lower half 2 4 6 8 10

Q1 = 6

Q3 median of upper half 12 14 16 18 20

Q3 = 16

11

Pulse Rates n = 138

Stem Leaves4

3 4 5889 5 00123344410 5 555678889923 6 0001111112223333334444423 6 5555666666777778888888816 7 0000011222233444423 7 5555566666677788888899910 8 000011222410 8 55556677894 9 00122 9 584 10 0223

101 11 1

Median mean of pulses in locations 69 amp 70 median= (70+70)2=70

Q1 median of lower half (lower half = 69 smallest pulses) Q1 = pulse in ordered position 35Q1 = 63

Q3 median of upper half (upper half = 69 largest pulses) Q3= pulse in position 35 from the high end Q3=78

Below are the weights of 31 linemen on the NCSU football team What is the

value of the first quartile Q1

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 287

2 2575

3 2635

4 2625

Interquartile range another measure of spread

lower quartile Q1

middle quartile median upper quartile Q3

interquartile range (IQR)

IQR = Q3 ndash Q1

measures spread of middle 50 of the data

Example beginning pulse rates

Q3 = 78 Q1 = 63

IQR = 78 ndash 63 = 15

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 33: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Based on the histo-gram about what percent of the values are between 475 and 525

1 50

2 5

3 17

4 30

Stem and leaf displays Have the following general appearance

stem leaf

1 8 9

2 1 2 8 9 9

3 2 3 8 9

4 0 1

5 6 7

6 4

Example employee ages at a small company

18 21 22 19 32 33 40 41 56 57 64 28 29 29 38 39 stem 10rsquos digit leaf 1rsquos digit

18 stem=1 leaf=8 18 = 1 | 8

stem leaf

1 8 9

2 1 2 8 9 9

3 2 3 8 9

4 0 1

5 6 7

6 4

Suppose a 95 yr old is hiredstem leaf

1 8 9

2 1 2 8 9 9

3 2 3 8 9

4 0 1

5 6 7

6 4

7

8

9 5

Number of TD passes by NFL teams 2012-2013 season(stems are 10rsquos digit)

stem leaf

43

03247

2 6677789

2 01222233444

1 13467889

0 8

Pulse Rates n = 138

Stem Leaves 4 3 4 588 9 5 001233444 10 5 5556788899 23 6 00011111122233333344444 23 6 55556666667777788888888 16 7 00000112222334444 23 7 55555666666777888888999 10 8 0000112224 10 8 5555667789 4 9 0012 2 9 58 4 10 0223 10 1 11 1

AdvantagesDisadvantages of Stem-and-Leaf Displays

Advantages

1) each measurement displayed

2) ascending order in each stem row

3) relatively simple (data set not too large) Disadvantages

display becomes unwieldy for large data sets

Population of 185 US cities with between 100000 and 500000

Multiply stems by 100000

Back-to-back stem-and-leaf displays TD passes by NFL teams 1999-2000 2012-13multiply stems by 10

1999-2000 2012-13

2 4 03

6 3 7

2 3 24

6655 2 6677789

43322221100 2 01222233444

9998887666 1 67889

421 1 134

0 8

Below is a stem-and-leaf display for the pulse rates of 24 women at a health clinic How many pulses are between 67 and 77

Stems are 10rsquos digits

1 4

2 6

3 8

4 10

5 12

Other Graphical Methods for Data Time plots

plot observations in time order time on horizontal axis variable on vertical axis

Time series

measurements are taken at regular intervals (monthly unemployment quarterly GDP weather records electricity demand etc)

Heat maps word walls

Unemployment Rate by Educational Attainment

Water Use During Super Bowl XLV(Packers 31 Steelers 25)

Heat Maps

Word Wall (customer feedback)

Section 32Describing the Center of Data

Mean

Median

2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability (next section)

measures how ldquospread outrdquo the data is

Notation for Data Valuesand Sample Mean

1 2

1 2

3

The sample size is denoted by

For a variable denoted by its observations are denoted by

A common measure of center is the sample mean

The sample mean is denoted by

Shorte

n

n

y y yy

n

y

y y y y

y

n

1 21

1

ned expression for using the symbol

(uppercase Greek letter sigma)n

n

i

i n

i

i

y

y y y

yy

n

y

Simple Example of Sample Mean

Weekly TV viewing time in hours of 7 randomly selected 4th graders

19 40 16 12 10 6 and 97

1

7

1

19 40 16 12 10 6 9 112

11216

7 7

ii

ii

y

yy

Population Mean

1

population

population mea

Denoted by the Greek letter

is the size (for example =34000 for NCSU)

the value of is typically not known

we often use the sample mean

to estimat

n

e the unknown

N

ii

y

N N

y

N

value of

Connection Between Mean and Histogram

A histogram balances when supported at the mean Mean x = 1406

Histogram

0

10

20

30

40

50

60

70

118

5

125

5

132

5

139

5

146

5

153

5

16

05

Mo

re

Absences f rom Work

Fre

qu

en

cy

Frequency

The median anothermeasure of center

Given a set of n data values arranged in order of magnitude

Median= middle value n odd

mean of 2 middle values n even

Ex 2 4 6 8 10 n=5 median=6 Ex 2 4 6 8 n=4 median=(4+6)2=5

Student Pulse Rates (n=62)

38 59 60 60 62 62 63 63 64 64 65 67 68 70 70 70 70 70 70 70 71 71 72 72 73 74 74 75 75 75 75 76 77 77 77 77 78 78 79 79 80 80 80 84 84 85 85 87 90 90 91 92 93 94 94 95 96 96 96 98 98 103

Median = (75+76)2 = 755

The median splits the histogram into 2 halves of equal area

Mean balance pointMedian 50 area each half

mean 5526 years median 577years

Medians are used often

Year 2011 baseball salaries

Median $1450000 (max=$32000000 Alex Rodriguez min=$414000)

Median fan age MLB 45 NFL 43 NBA 41 NHL 39

Median existing home sales price May 2011 $166500 May 2010 $174600

Median household income (2008 dollars) 2009 $50221 2008 $52029

Examples Example n = 7

175 28 32 139 141 253 458 Example n = 7 (ordered) 28 32 139 141 175 253 458 Example n = 8

175 28 32 139 141 253 357 458

Example n =8 (ordered)

28 32 139 141 175 253 357 458

m = 141

m = (141+175)2 = 158

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496049604971524555467586

1 5245

2 49655

3 4960

4 4971

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496052455546497155877586

1 5245

2 49655

3 5546

4 4971

Properties of Mean Median1The mean and median are unique that is a

data set has only 1 mean and 1 median (the mean and median are not necessarily equal)

2The mean uses the value of every number in the data set the median does not

14

20 4 6Ex 2 4 6 8 5 5

4 2

21 4 6Ex 2 4 6 9 5 5

4 2

x m

x m

Example class pulse rates

53 64 67 67 70 76 77 77 78 83 84 85 85 89 90 90 90 90 91 96 98 103 140

23

1

23

844823

location 12th obs 85

ii

n

xx

m m

2010 2014 baseball salaries

2010

n = 845

mean = $3297828

median = $1330000

max = $33000000

2014

n = 848

mean = $3932912

median = $1456250

max = $28000000

>

Disadvantage of the mean

Can be greatly influenced by just a few observations that are much greater or much smaller than the rest of the data

Mean Median Maximum Baseball Salaries 1985 - 201419

85

1987

1989

1991

1993

1995

1997

1999

2001

2003

2005

2007

2009

2011

2013

200000

700000

1200000

1700000

2200000

2700000

3200000

3700000

0

5000000

10000000

15000000

20000000

25000000

30000000

35000000

Baseball Salaries Mean Median and Maximum 1985-2014

Mean Median Maximum

Year

Mea

n M

edia

n S

alar

y

Max

imu

m S

alar

y

Skewness comparing the mean and median

Skewed to the right (positively skewed) meangtmedian

53

490

102 7235 21 26 17 8 10 2 3 1 0 0 1

0

100

200

300

400

500

600

Freq

uenc

y

Salary ($1000s)

2011 Baseball Salaries

Skewed to the left negatively skewed

Mean lt median mean=78 median=87

Histogram of Exam Scores

0

10

20

30

20 30 40 50 60 70 80 90 100Exam Scores

Fre

qu

en

cy

Symmetric data

mean median approx equal

Bank Customers 1000-1100 am

0

5

10

15

20

Number of Customers

Fre

qu

en

cy

Section 33Describing Variability of Data

Standard Deviation

Using the Mean and Standard Deviation Together 68-95-997

Rule (Empirical Rule)

Recall 2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability

measures how ldquospread outrdquo the data is

Ways to measure variability

1 range=largest-smallest

ok sometimes in general too crude sensitive to one large or small obs

1

2 where

the middle is the mean

deviation of from the mean

( ) sum the deviations of all the s from

measure spread from the middle

i i

n

i ii

y

y y y

y y y y

1

( ) 0 always tells us nothingn

ii

y y

Example

1 2

1 2

1 2

1 2

sum of deviations from mean

49 51 50

( ) ( ) (49 50) (51 50) 1 1 0

0 100

Data set 1

Data set 2 50

( ) ( ) (0 50) (100 50) 50 50 0

x x x

x x x x

y y y

y y y y

The Sample Standard Deviation a measure of spread around the mean Square the deviation of each

observation from the mean find the square root of the ldquoaveragerdquo of these squared deviations

2

1

2

2 1

( )sample standard deviation

1

( )is called the sample variance

1

n

ii

n

ii

y ys

n

y ys

n

Calculations hellip

Mean = 634

Sum of squared deviations from mean = 852

(n minus 1) = 13 (n minus 1) is called degrees freedom (df)

s2 = variance = 85213 = 655 square inches

s = standard deviation = radic655 = 256 inches

Women height (inches)i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

2

1

2 )(1

1xx

ns

n

i

1 First calculate the variance s22 Then take the square root to get the

standard deviation s

2

1

)(1

1xx

ns

n

i

Meanplusmn 1 sd

Wersquoll never calculate these by hand so make sure to know how to get the standard deviation using your calculator Excel or other software

Population Standard Deviation

2

1

Denoted by the lower case Greek letter

is the size (for example =34000 for NCSU)

is the mean

( )population standard deviation

va

po

lue of typically not known

us

pulation

populatio

e

n

N

ii

N N

y

N

s

to estimate value of

Remarks

1 The standard deviation of a set of measurements is an estimate of the likely size of the chance error in a single measurement

Remarks (cont)

2 Note that s and s are always greater than or equal to zero

3 The larger the value of s (or s ) the greater the spread of the data

When does s=0 When does s =0

When all data values are the same

Remarks (cont)4 The standard deviation is the most

commonly used measure of risk in finance and businessndash Stocks Mutual Funds etc

5 Variance s2 sample variance 2 population variance Units are squared units of the original data square $ square gallons

Review Properties of s and s s and s are always greater than or

equal to 0

when does s = 0 s = 0 The larger the value of s (or s) the

greater the spread of the data the standard deviation of a set of

measurements is an estimate of the likely size of the chance error in a single measurement

Summary of Notation

2

SAMPLE

sample mean

sample median

sample variance

sample stand dev

y

m

s

s

2

POPULATION

population mean

population median

population variance

population stand dev

m

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

68-95-997 rule

Mean andStandard Deviation

(numerical)

Histogram(graphical)

68-95-997 rule

The 68-95-997 ruleIf the histogram of the data is

approximately bell-shaped then1) approximately of the measurements

are of the mean

that is in ( )

2) approximately of the measurement

68

within 1 standard deviation

95

within 2 standard deviation

s

are of the meas n

that is

y s y s

almost all

within 3 standard deviation

in ( 2 2 )

3) the measurements

are of the mean

that is in ( 3 3 )

s

y s y s

y s y s

68-95-997 rule 68 within 1 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

68

3434

y-s y y+s

68-95-997 rule 95 within 2 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

95

475 475

y-2s y y+2s

Example textbook costs

37548

4272

50

y

s

n

286 291 307 308 315 316 327328 340 342 346 347 348 348 349354 355 355 360 361 364 367 369371 373 377 380 381 382 385 385387 390 390 397 398 409 409 410418 422 424 425 426 428 433 434437 440 480

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( ) (33276 41820)

32percentage of data values in this interval 64

5068-95-997 rule 68

y s

y s y s

1 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 2 2 ) (29004 46092)

48percentage of data values in this interval 96

5068-95-997 rule 95

y s

y s y s

2 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 3 3 ) (24732 50364)

50percentage of data values in this interval 100

5068-95-997 rule 997

y s

y s y s

3 standard deviation interval about the mean

The best estimate of the standard deviation of the menrsquos weights

displayed in this dotplot is

1 10

2 15

3 20

4 40

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

Preceding slides Next

Z-scores Standardized Data Values

Measures the distance of a number from the mean in units of

the standard deviation

z-score corresponding to y

where

original data value

the sample mean

s the sample standard deviation

the z-score corresponding to

y yz

s

y

y

z y

Exam 1 y1 = 88 s1 = 6 your exam 1 score 91

Exam 2 y2 = 88 s2 = 10 your exam 2 score 92

Which score is better

1

2

91 88 3z 5

6 692 88 4

z 410 10

91 on exam 1 is better than 92 on exam 2

If data has mean and standard deviation

then standardizing a particular value of

indicates how many standard deviations

is above or below the mean

y s

y

y

y

Comparing SAT and ACT Scores

SAT Math Eleanorrsquos score 680

SAT mean =500 sd=100 ACT Math Geraldrsquos score 27

ACT mean=18 sd=6 Eleanorrsquos z-score z=(680-500)100=18 Geraldrsquos z-score z=(27-18)6=15 Eleanorrsquos score is better

Z-scores add to zeroStudentInstitutional Support to Athletic Depts For the 9 Public ACC

Schools 2013 ($ millions)

School Support y - ybar Z-score

Maryland 155 64 179

UVA 131 40 112

Louisville 109 18 050

UNC 92 01 003

VaTech 79 -12 -034

FSU 79 -12 -034

GaTech 71 -20 -056

NCSU 65 -26 -073

Clemson 38 -53 -147

Mean=91000 s=35697

Sum = 0 Sum = 0

Recently the mean tuition at 4-yr public collegesuniversities in the US was $6185 with a standard deviation of $1804 In NC the mean tuition was $4320 What is NCrsquos z-score

1 103

2 -103

3 239

4 1865

5 -1865

Section 34Measures of Position (also called Measures of Relative Standing)

Quartiles

5-Number Summary

Interquartile Range Another Measure of Spread

Boxplots

m = median = 34

Q1= first quartile = 23

Q3= third quartile = 42

1 1 062 2 123 3 164 4 195 5 156 6 217 7 238 6 239 5 2510 4 2811 3 2912 2 3313 1 3414 2 3615 3 3716 4 3817 5 3918 6 4119 7 4220 6 4521 5 4722 4 4923 3 5324 2 5625 1 61

Quartiles Measuring spread by examining the middleThe first quartile Q1 is the value in the

sample that has 25 of the data at or

below it (Q1 is the median of the lower

half of the sorted data)

The third quartile Q3 is the value in the

sample that has 75 of the data at or

below it (Q3 is the median of the upper

half of the sorted data)

Quartiles and median divide data into 4 pieces

Q1 M Q3

14 14 14 14

Quartiles are common measures of spread

httpoirpncsueduiradmit

httpoirpncsueduunivpeer

University of Southern California

Economic Value of College Majors

Rules for Calculating QuartilesStep 1 find the median of all the data (the median divides the data in half)

Step 2a find the median of the lower half this median is Q1Step 2b find the median of the upper half this median is Q3

Importantwhen n is odd include the overall median in both halveswhen n is even do not include the overall median in either half

Example 2 4 6 8 10 12 14 16 18 20 n = 10

Median m = (10+12)2 = 222 = 11

Q1 median of lower half 2 4 6 8 10

Q1 = 6

Q3 median of upper half 12 14 16 18 20

Q3 = 16

11

Pulse Rates n = 138

Stem Leaves4

3 4 5889 5 00123344410 5 555678889923 6 0001111112223333334444423 6 5555666666777778888888816 7 0000011222233444423 7 5555566666677788888899910 8 000011222410 8 55556677894 9 00122 9 584 10 0223

101 11 1

Median mean of pulses in locations 69 amp 70 median= (70+70)2=70

Q1 median of lower half (lower half = 69 smallest pulses) Q1 = pulse in ordered position 35Q1 = 63

Q3 median of upper half (upper half = 69 largest pulses) Q3= pulse in position 35 from the high end Q3=78

Below are the weights of 31 linemen on the NCSU football team What is the

value of the first quartile Q1

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 287

2 2575

3 2635

4 2625

Interquartile range another measure of spread

lower quartile Q1

middle quartile median upper quartile Q3

interquartile range (IQR)

IQR = Q3 ndash Q1

measures spread of middle 50 of the data

Example beginning pulse rates

Q3 = 78 Q1 = 63

IQR = 78 ndash 63 = 15

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 34: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Stem and leaf displays Have the following general appearance

stem leaf

1 8 9

2 1 2 8 9 9

3 2 3 8 9

4 0 1

5 6 7

6 4

Example employee ages at a small company

18 21 22 19 32 33 40 41 56 57 64 28 29 29 38 39 stem 10rsquos digit leaf 1rsquos digit

18 stem=1 leaf=8 18 = 1 | 8

stem leaf

1 8 9

2 1 2 8 9 9

3 2 3 8 9

4 0 1

5 6 7

6 4

Suppose a 95 yr old is hiredstem leaf

1 8 9

2 1 2 8 9 9

3 2 3 8 9

4 0 1

5 6 7

6 4

7

8

9 5

Number of TD passes by NFL teams 2012-2013 season(stems are 10rsquos digit)

stem leaf

43

03247

2 6677789

2 01222233444

1 13467889

0 8

Pulse Rates n = 138

Stem Leaves 4 3 4 588 9 5 001233444 10 5 5556788899 23 6 00011111122233333344444 23 6 55556666667777788888888 16 7 00000112222334444 23 7 55555666666777888888999 10 8 0000112224 10 8 5555667789 4 9 0012 2 9 58 4 10 0223 10 1 11 1

AdvantagesDisadvantages of Stem-and-Leaf Displays

Advantages

1) each measurement displayed

2) ascending order in each stem row

3) relatively simple (data set not too large) Disadvantages

display becomes unwieldy for large data sets

Population of 185 US cities with between 100000 and 500000

Multiply stems by 100000

Back-to-back stem-and-leaf displays TD passes by NFL teams 1999-2000 2012-13multiply stems by 10

1999-2000 2012-13

2 4 03

6 3 7

2 3 24

6655 2 6677789

43322221100 2 01222233444

9998887666 1 67889

421 1 134

0 8

Below is a stem-and-leaf display for the pulse rates of 24 women at a health clinic How many pulses are between 67 and 77

Stems are 10rsquos digits

1 4

2 6

3 8

4 10

5 12

Other Graphical Methods for Data Time plots

plot observations in time order time on horizontal axis variable on vertical axis

Time series

measurements are taken at regular intervals (monthly unemployment quarterly GDP weather records electricity demand etc)

Heat maps word walls

Unemployment Rate by Educational Attainment

Water Use During Super Bowl XLV(Packers 31 Steelers 25)

Heat Maps

Word Wall (customer feedback)

Section 32Describing the Center of Data

Mean

Median

2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability (next section)

measures how ldquospread outrdquo the data is

Notation for Data Valuesand Sample Mean

1 2

1 2

3

The sample size is denoted by

For a variable denoted by its observations are denoted by

A common measure of center is the sample mean

The sample mean is denoted by

Shorte

n

n

y y yy

n

y

y y y y

y

n

1 21

1

ned expression for using the symbol

(uppercase Greek letter sigma)n

n

i

i n

i

i

y

y y y

yy

n

y

Simple Example of Sample Mean

Weekly TV viewing time in hours of 7 randomly selected 4th graders

19 40 16 12 10 6 and 97

1

7

1

19 40 16 12 10 6 9 112

11216

7 7

ii

ii

y

yy

Population Mean

1

population

population mea

Denoted by the Greek letter

is the size (for example =34000 for NCSU)

the value of is typically not known

we often use the sample mean

to estimat

n

e the unknown

N

ii

y

N N

y

N

value of

Connection Between Mean and Histogram

A histogram balances when supported at the mean Mean x = 1406

Histogram

0

10

20

30

40

50

60

70

118

5

125

5

132

5

139

5

146

5

153

5

16

05

Mo

re

Absences f rom Work

Fre

qu

en

cy

Frequency

The median anothermeasure of center

Given a set of n data values arranged in order of magnitude

Median= middle value n odd

mean of 2 middle values n even

Ex 2 4 6 8 10 n=5 median=6 Ex 2 4 6 8 n=4 median=(4+6)2=5

Student Pulse Rates (n=62)

38 59 60 60 62 62 63 63 64 64 65 67 68 70 70 70 70 70 70 70 71 71 72 72 73 74 74 75 75 75 75 76 77 77 77 77 78 78 79 79 80 80 80 84 84 85 85 87 90 90 91 92 93 94 94 95 96 96 96 98 98 103

Median = (75+76)2 = 755

The median splits the histogram into 2 halves of equal area

Mean balance pointMedian 50 area each half

mean 5526 years median 577years

Medians are used often

Year 2011 baseball salaries

Median $1450000 (max=$32000000 Alex Rodriguez min=$414000)

Median fan age MLB 45 NFL 43 NBA 41 NHL 39

Median existing home sales price May 2011 $166500 May 2010 $174600

Median household income (2008 dollars) 2009 $50221 2008 $52029

Examples Example n = 7

175 28 32 139 141 253 458 Example n = 7 (ordered) 28 32 139 141 175 253 458 Example n = 8

175 28 32 139 141 253 357 458

Example n =8 (ordered)

28 32 139 141 175 253 357 458

m = 141

m = (141+175)2 = 158

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496049604971524555467586

1 5245

2 49655

3 4960

4 4971

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496052455546497155877586

1 5245

2 49655

3 5546

4 4971

Properties of Mean Median1The mean and median are unique that is a

data set has only 1 mean and 1 median (the mean and median are not necessarily equal)

2The mean uses the value of every number in the data set the median does not

14

20 4 6Ex 2 4 6 8 5 5

4 2

21 4 6Ex 2 4 6 9 5 5

4 2

x m

x m

Example class pulse rates

53 64 67 67 70 76 77 77 78 83 84 85 85 89 90 90 90 90 91 96 98 103 140

23

1

23

844823

location 12th obs 85

ii

n

xx

m m

2010 2014 baseball salaries

2010

n = 845

mean = $3297828

median = $1330000

max = $33000000

2014

n = 848

mean = $3932912

median = $1456250

max = $28000000

>

Disadvantage of the mean

Can be greatly influenced by just a few observations that are much greater or much smaller than the rest of the data

Mean Median Maximum Baseball Salaries 1985 - 201419

85

1987

1989

1991

1993

1995

1997

1999

2001

2003

2005

2007

2009

2011

2013

200000

700000

1200000

1700000

2200000

2700000

3200000

3700000

0

5000000

10000000

15000000

20000000

25000000

30000000

35000000

Baseball Salaries Mean Median and Maximum 1985-2014

Mean Median Maximum

Year

Mea

n M

edia

n S

alar

y

Max

imu

m S

alar

y

Skewness comparing the mean and median

Skewed to the right (positively skewed) meangtmedian

53

490

102 7235 21 26 17 8 10 2 3 1 0 0 1

0

100

200

300

400

500

600

Freq

uenc

y

Salary ($1000s)

2011 Baseball Salaries

Skewed to the left negatively skewed

Mean lt median mean=78 median=87

Histogram of Exam Scores

0

10

20

30

20 30 40 50 60 70 80 90 100Exam Scores

Fre

qu

en

cy

Symmetric data

mean median approx equal

Bank Customers 1000-1100 am

0

5

10

15

20

Number of Customers

Fre

qu

en

cy

Section 33Describing Variability of Data

Standard Deviation

Using the Mean and Standard Deviation Together 68-95-997

Rule (Empirical Rule)

Recall 2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability

measures how ldquospread outrdquo the data is

Ways to measure variability

1 range=largest-smallest

ok sometimes in general too crude sensitive to one large or small obs

1

2 where

the middle is the mean

deviation of from the mean

( ) sum the deviations of all the s from

measure spread from the middle

i i

n

i ii

y

y y y

y y y y

1

( ) 0 always tells us nothingn

ii

y y

Example

1 2

1 2

1 2

1 2

sum of deviations from mean

49 51 50

( ) ( ) (49 50) (51 50) 1 1 0

0 100

Data set 1

Data set 2 50

( ) ( ) (0 50) (100 50) 50 50 0

x x x

x x x x

y y y

y y y y

The Sample Standard Deviation a measure of spread around the mean Square the deviation of each

observation from the mean find the square root of the ldquoaveragerdquo of these squared deviations

2

1

2

2 1

( )sample standard deviation

1

( )is called the sample variance

1

n

ii

n

ii

y ys

n

y ys

n

Calculations hellip

Mean = 634

Sum of squared deviations from mean = 852

(n minus 1) = 13 (n minus 1) is called degrees freedom (df)

s2 = variance = 85213 = 655 square inches

s = standard deviation = radic655 = 256 inches

Women height (inches)i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

2

1

2 )(1

1xx

ns

n

i

1 First calculate the variance s22 Then take the square root to get the

standard deviation s

2

1

)(1

1xx

ns

n

i

Meanplusmn 1 sd

Wersquoll never calculate these by hand so make sure to know how to get the standard deviation using your calculator Excel or other software

Population Standard Deviation

2

1

Denoted by the lower case Greek letter

is the size (for example =34000 for NCSU)

is the mean

( )population standard deviation

va

po

lue of typically not known

us

pulation

populatio

e

n

N

ii

N N

y

N

s

to estimate value of

Remarks

1 The standard deviation of a set of measurements is an estimate of the likely size of the chance error in a single measurement

Remarks (cont)

2 Note that s and s are always greater than or equal to zero

3 The larger the value of s (or s ) the greater the spread of the data

When does s=0 When does s =0

When all data values are the same

Remarks (cont)4 The standard deviation is the most

commonly used measure of risk in finance and businessndash Stocks Mutual Funds etc

5 Variance s2 sample variance 2 population variance Units are squared units of the original data square $ square gallons

Review Properties of s and s s and s are always greater than or

equal to 0

when does s = 0 s = 0 The larger the value of s (or s) the

greater the spread of the data the standard deviation of a set of

measurements is an estimate of the likely size of the chance error in a single measurement

Summary of Notation

2

SAMPLE

sample mean

sample median

sample variance

sample stand dev

y

m

s

s

2

POPULATION

population mean

population median

population variance

population stand dev

m

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

68-95-997 rule

Mean andStandard Deviation

(numerical)

Histogram(graphical)

68-95-997 rule

The 68-95-997 ruleIf the histogram of the data is

approximately bell-shaped then1) approximately of the measurements

are of the mean

that is in ( )

2) approximately of the measurement

68

within 1 standard deviation

95

within 2 standard deviation

s

are of the meas n

that is

y s y s

almost all

within 3 standard deviation

in ( 2 2 )

3) the measurements

are of the mean

that is in ( 3 3 )

s

y s y s

y s y s

68-95-997 rule 68 within 1 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

68

3434

y-s y y+s

68-95-997 rule 95 within 2 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

95

475 475

y-2s y y+2s

Example textbook costs

37548

4272

50

y

s

n

286 291 307 308 315 316 327328 340 342 346 347 348 348 349354 355 355 360 361 364 367 369371 373 377 380 381 382 385 385387 390 390 397 398 409 409 410418 422 424 425 426 428 433 434437 440 480

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( ) (33276 41820)

32percentage of data values in this interval 64

5068-95-997 rule 68

y s

y s y s

1 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 2 2 ) (29004 46092)

48percentage of data values in this interval 96

5068-95-997 rule 95

y s

y s y s

2 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 3 3 ) (24732 50364)

50percentage of data values in this interval 100

5068-95-997 rule 997

y s

y s y s

3 standard deviation interval about the mean

The best estimate of the standard deviation of the menrsquos weights

displayed in this dotplot is

1 10

2 15

3 20

4 40

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

Preceding slides Next

Z-scores Standardized Data Values

Measures the distance of a number from the mean in units of

the standard deviation

z-score corresponding to y

where

original data value

the sample mean

s the sample standard deviation

the z-score corresponding to

y yz

s

y

y

z y

Exam 1 y1 = 88 s1 = 6 your exam 1 score 91

Exam 2 y2 = 88 s2 = 10 your exam 2 score 92

Which score is better

1

2

91 88 3z 5

6 692 88 4

z 410 10

91 on exam 1 is better than 92 on exam 2

If data has mean and standard deviation

then standardizing a particular value of

indicates how many standard deviations

is above or below the mean

y s

y

y

y

Comparing SAT and ACT Scores

SAT Math Eleanorrsquos score 680

SAT mean =500 sd=100 ACT Math Geraldrsquos score 27

ACT mean=18 sd=6 Eleanorrsquos z-score z=(680-500)100=18 Geraldrsquos z-score z=(27-18)6=15 Eleanorrsquos score is better

Z-scores add to zeroStudentInstitutional Support to Athletic Depts For the 9 Public ACC

Schools 2013 ($ millions)

School Support y - ybar Z-score

Maryland 155 64 179

UVA 131 40 112

Louisville 109 18 050

UNC 92 01 003

VaTech 79 -12 -034

FSU 79 -12 -034

GaTech 71 -20 -056

NCSU 65 -26 -073

Clemson 38 -53 -147

Mean=91000 s=35697

Sum = 0 Sum = 0

Recently the mean tuition at 4-yr public collegesuniversities in the US was $6185 with a standard deviation of $1804 In NC the mean tuition was $4320 What is NCrsquos z-score

1 103

2 -103

3 239

4 1865

5 -1865

Section 34Measures of Position (also called Measures of Relative Standing)

Quartiles

5-Number Summary

Interquartile Range Another Measure of Spread

Boxplots

m = median = 34

Q1= first quartile = 23

Q3= third quartile = 42

1 1 062 2 123 3 164 4 195 5 156 6 217 7 238 6 239 5 2510 4 2811 3 2912 2 3313 1 3414 2 3615 3 3716 4 3817 5 3918 6 4119 7 4220 6 4521 5 4722 4 4923 3 5324 2 5625 1 61

Quartiles Measuring spread by examining the middleThe first quartile Q1 is the value in the

sample that has 25 of the data at or

below it (Q1 is the median of the lower

half of the sorted data)

The third quartile Q3 is the value in the

sample that has 75 of the data at or

below it (Q3 is the median of the upper

half of the sorted data)

Quartiles and median divide data into 4 pieces

Q1 M Q3

14 14 14 14

Quartiles are common measures of spread

httpoirpncsueduiradmit

httpoirpncsueduunivpeer

University of Southern California

Economic Value of College Majors

Rules for Calculating QuartilesStep 1 find the median of all the data (the median divides the data in half)

Step 2a find the median of the lower half this median is Q1Step 2b find the median of the upper half this median is Q3

Importantwhen n is odd include the overall median in both halveswhen n is even do not include the overall median in either half

Example 2 4 6 8 10 12 14 16 18 20 n = 10

Median m = (10+12)2 = 222 = 11

Q1 median of lower half 2 4 6 8 10

Q1 = 6

Q3 median of upper half 12 14 16 18 20

Q3 = 16

11

Pulse Rates n = 138

Stem Leaves4

3 4 5889 5 00123344410 5 555678889923 6 0001111112223333334444423 6 5555666666777778888888816 7 0000011222233444423 7 5555566666677788888899910 8 000011222410 8 55556677894 9 00122 9 584 10 0223

101 11 1

Median mean of pulses in locations 69 amp 70 median= (70+70)2=70

Q1 median of lower half (lower half = 69 smallest pulses) Q1 = pulse in ordered position 35Q1 = 63

Q3 median of upper half (upper half = 69 largest pulses) Q3= pulse in position 35 from the high end Q3=78

Below are the weights of 31 linemen on the NCSU football team What is the

value of the first quartile Q1

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 287

2 2575

3 2635

4 2625

Interquartile range another measure of spread

lower quartile Q1

middle quartile median upper quartile Q3

interquartile range (IQR)

IQR = Q3 ndash Q1

measures spread of middle 50 of the data

Example beginning pulse rates

Q3 = 78 Q1 = 63

IQR = 78 ndash 63 = 15

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 35: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Example employee ages at a small company

18 21 22 19 32 33 40 41 56 57 64 28 29 29 38 39 stem 10rsquos digit leaf 1rsquos digit

18 stem=1 leaf=8 18 = 1 | 8

stem leaf

1 8 9

2 1 2 8 9 9

3 2 3 8 9

4 0 1

5 6 7

6 4

Suppose a 95 yr old is hiredstem leaf

1 8 9

2 1 2 8 9 9

3 2 3 8 9

4 0 1

5 6 7

6 4

7

8

9 5

Number of TD passes by NFL teams 2012-2013 season(stems are 10rsquos digit)

stem leaf

43

03247

2 6677789

2 01222233444

1 13467889

0 8

Pulse Rates n = 138

Stem Leaves 4 3 4 588 9 5 001233444 10 5 5556788899 23 6 00011111122233333344444 23 6 55556666667777788888888 16 7 00000112222334444 23 7 55555666666777888888999 10 8 0000112224 10 8 5555667789 4 9 0012 2 9 58 4 10 0223 10 1 11 1

AdvantagesDisadvantages of Stem-and-Leaf Displays

Advantages

1) each measurement displayed

2) ascending order in each stem row

3) relatively simple (data set not too large) Disadvantages

display becomes unwieldy for large data sets

Population of 185 US cities with between 100000 and 500000

Multiply stems by 100000

Back-to-back stem-and-leaf displays TD passes by NFL teams 1999-2000 2012-13multiply stems by 10

1999-2000 2012-13

2 4 03

6 3 7

2 3 24

6655 2 6677789

43322221100 2 01222233444

9998887666 1 67889

421 1 134

0 8

Below is a stem-and-leaf display for the pulse rates of 24 women at a health clinic How many pulses are between 67 and 77

Stems are 10rsquos digits

1 4

2 6

3 8

4 10

5 12

Other Graphical Methods for Data Time plots

plot observations in time order time on horizontal axis variable on vertical axis

Time series

measurements are taken at regular intervals (monthly unemployment quarterly GDP weather records electricity demand etc)

Heat maps word walls

Unemployment Rate by Educational Attainment

Water Use During Super Bowl XLV(Packers 31 Steelers 25)

Heat Maps

Word Wall (customer feedback)

Section 32Describing the Center of Data

Mean

Median

2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability (next section)

measures how ldquospread outrdquo the data is

Notation for Data Valuesand Sample Mean

1 2

1 2

3

The sample size is denoted by

For a variable denoted by its observations are denoted by

A common measure of center is the sample mean

The sample mean is denoted by

Shorte

n

n

y y yy

n

y

y y y y

y

n

1 21

1

ned expression for using the symbol

(uppercase Greek letter sigma)n

n

i

i n

i

i

y

y y y

yy

n

y

Simple Example of Sample Mean

Weekly TV viewing time in hours of 7 randomly selected 4th graders

19 40 16 12 10 6 and 97

1

7

1

19 40 16 12 10 6 9 112

11216

7 7

ii

ii

y

yy

Population Mean

1

population

population mea

Denoted by the Greek letter

is the size (for example =34000 for NCSU)

the value of is typically not known

we often use the sample mean

to estimat

n

e the unknown

N

ii

y

N N

y

N

value of

Connection Between Mean and Histogram

A histogram balances when supported at the mean Mean x = 1406

Histogram

0

10

20

30

40

50

60

70

118

5

125

5

132

5

139

5

146

5

153

5

16

05

Mo

re

Absences f rom Work

Fre

qu

en

cy

Frequency

The median anothermeasure of center

Given a set of n data values arranged in order of magnitude

Median= middle value n odd

mean of 2 middle values n even

Ex 2 4 6 8 10 n=5 median=6 Ex 2 4 6 8 n=4 median=(4+6)2=5

Student Pulse Rates (n=62)

38 59 60 60 62 62 63 63 64 64 65 67 68 70 70 70 70 70 70 70 71 71 72 72 73 74 74 75 75 75 75 76 77 77 77 77 78 78 79 79 80 80 80 84 84 85 85 87 90 90 91 92 93 94 94 95 96 96 96 98 98 103

Median = (75+76)2 = 755

The median splits the histogram into 2 halves of equal area

Mean balance pointMedian 50 area each half

mean 5526 years median 577years

Medians are used often

Year 2011 baseball salaries

Median $1450000 (max=$32000000 Alex Rodriguez min=$414000)

Median fan age MLB 45 NFL 43 NBA 41 NHL 39

Median existing home sales price May 2011 $166500 May 2010 $174600

Median household income (2008 dollars) 2009 $50221 2008 $52029

Examples Example n = 7

175 28 32 139 141 253 458 Example n = 7 (ordered) 28 32 139 141 175 253 458 Example n = 8

175 28 32 139 141 253 357 458

Example n =8 (ordered)

28 32 139 141 175 253 357 458

m = 141

m = (141+175)2 = 158

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496049604971524555467586

1 5245

2 49655

3 4960

4 4971

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496052455546497155877586

1 5245

2 49655

3 5546

4 4971

Properties of Mean Median1The mean and median are unique that is a

data set has only 1 mean and 1 median (the mean and median are not necessarily equal)

2The mean uses the value of every number in the data set the median does not

14

20 4 6Ex 2 4 6 8 5 5

4 2

21 4 6Ex 2 4 6 9 5 5

4 2

x m

x m

Example class pulse rates

53 64 67 67 70 76 77 77 78 83 84 85 85 89 90 90 90 90 91 96 98 103 140

23

1

23

844823

location 12th obs 85

ii

n

xx

m m

2010 2014 baseball salaries

2010

n = 845

mean = $3297828

median = $1330000

max = $33000000

2014

n = 848

mean = $3932912

median = $1456250

max = $28000000

>

Disadvantage of the mean

Can be greatly influenced by just a few observations that are much greater or much smaller than the rest of the data

Mean Median Maximum Baseball Salaries 1985 - 201419

85

1987

1989

1991

1993

1995

1997

1999

2001

2003

2005

2007

2009

2011

2013

200000

700000

1200000

1700000

2200000

2700000

3200000

3700000

0

5000000

10000000

15000000

20000000

25000000

30000000

35000000

Baseball Salaries Mean Median and Maximum 1985-2014

Mean Median Maximum

Year

Mea

n M

edia

n S

alar

y

Max

imu

m S

alar

y

Skewness comparing the mean and median

Skewed to the right (positively skewed) meangtmedian

53

490

102 7235 21 26 17 8 10 2 3 1 0 0 1

0

100

200

300

400

500

600

Freq

uenc

y

Salary ($1000s)

2011 Baseball Salaries

Skewed to the left negatively skewed

Mean lt median mean=78 median=87

Histogram of Exam Scores

0

10

20

30

20 30 40 50 60 70 80 90 100Exam Scores

Fre

qu

en

cy

Symmetric data

mean median approx equal

Bank Customers 1000-1100 am

0

5

10

15

20

Number of Customers

Fre

qu

en

cy

Section 33Describing Variability of Data

Standard Deviation

Using the Mean and Standard Deviation Together 68-95-997

Rule (Empirical Rule)

Recall 2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability

measures how ldquospread outrdquo the data is

Ways to measure variability

1 range=largest-smallest

ok sometimes in general too crude sensitive to one large or small obs

1

2 where

the middle is the mean

deviation of from the mean

( ) sum the deviations of all the s from

measure spread from the middle

i i

n

i ii

y

y y y

y y y y

1

( ) 0 always tells us nothingn

ii

y y

Example

1 2

1 2

1 2

1 2

sum of deviations from mean

49 51 50

( ) ( ) (49 50) (51 50) 1 1 0

0 100

Data set 1

Data set 2 50

( ) ( ) (0 50) (100 50) 50 50 0

x x x

x x x x

y y y

y y y y

The Sample Standard Deviation a measure of spread around the mean Square the deviation of each

observation from the mean find the square root of the ldquoaveragerdquo of these squared deviations

2

1

2

2 1

( )sample standard deviation

1

( )is called the sample variance

1

n

ii

n

ii

y ys

n

y ys

n

Calculations hellip

Mean = 634

Sum of squared deviations from mean = 852

(n minus 1) = 13 (n minus 1) is called degrees freedom (df)

s2 = variance = 85213 = 655 square inches

s = standard deviation = radic655 = 256 inches

Women height (inches)i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

2

1

2 )(1

1xx

ns

n

i

1 First calculate the variance s22 Then take the square root to get the

standard deviation s

2

1

)(1

1xx

ns

n

i

Meanplusmn 1 sd

Wersquoll never calculate these by hand so make sure to know how to get the standard deviation using your calculator Excel or other software

Population Standard Deviation

2

1

Denoted by the lower case Greek letter

is the size (for example =34000 for NCSU)

is the mean

( )population standard deviation

va

po

lue of typically not known

us

pulation

populatio

e

n

N

ii

N N

y

N

s

to estimate value of

Remarks

1 The standard deviation of a set of measurements is an estimate of the likely size of the chance error in a single measurement

Remarks (cont)

2 Note that s and s are always greater than or equal to zero

3 The larger the value of s (or s ) the greater the spread of the data

When does s=0 When does s =0

When all data values are the same

Remarks (cont)4 The standard deviation is the most

commonly used measure of risk in finance and businessndash Stocks Mutual Funds etc

5 Variance s2 sample variance 2 population variance Units are squared units of the original data square $ square gallons

Review Properties of s and s s and s are always greater than or

equal to 0

when does s = 0 s = 0 The larger the value of s (or s) the

greater the spread of the data the standard deviation of a set of

measurements is an estimate of the likely size of the chance error in a single measurement

Summary of Notation

2

SAMPLE

sample mean

sample median

sample variance

sample stand dev

y

m

s

s

2

POPULATION

population mean

population median

population variance

population stand dev

m

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

68-95-997 rule

Mean andStandard Deviation

(numerical)

Histogram(graphical)

68-95-997 rule

The 68-95-997 ruleIf the histogram of the data is

approximately bell-shaped then1) approximately of the measurements

are of the mean

that is in ( )

2) approximately of the measurement

68

within 1 standard deviation

95

within 2 standard deviation

s

are of the meas n

that is

y s y s

almost all

within 3 standard deviation

in ( 2 2 )

3) the measurements

are of the mean

that is in ( 3 3 )

s

y s y s

y s y s

68-95-997 rule 68 within 1 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

68

3434

y-s y y+s

68-95-997 rule 95 within 2 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

95

475 475

y-2s y y+2s

Example textbook costs

37548

4272

50

y

s

n

286 291 307 308 315 316 327328 340 342 346 347 348 348 349354 355 355 360 361 364 367 369371 373 377 380 381 382 385 385387 390 390 397 398 409 409 410418 422 424 425 426 428 433 434437 440 480

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( ) (33276 41820)

32percentage of data values in this interval 64

5068-95-997 rule 68

y s

y s y s

1 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 2 2 ) (29004 46092)

48percentage of data values in this interval 96

5068-95-997 rule 95

y s

y s y s

2 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 3 3 ) (24732 50364)

50percentage of data values in this interval 100

5068-95-997 rule 997

y s

y s y s

3 standard deviation interval about the mean

The best estimate of the standard deviation of the menrsquos weights

displayed in this dotplot is

1 10

2 15

3 20

4 40

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

Preceding slides Next

Z-scores Standardized Data Values

Measures the distance of a number from the mean in units of

the standard deviation

z-score corresponding to y

where

original data value

the sample mean

s the sample standard deviation

the z-score corresponding to

y yz

s

y

y

z y

Exam 1 y1 = 88 s1 = 6 your exam 1 score 91

Exam 2 y2 = 88 s2 = 10 your exam 2 score 92

Which score is better

1

2

91 88 3z 5

6 692 88 4

z 410 10

91 on exam 1 is better than 92 on exam 2

If data has mean and standard deviation

then standardizing a particular value of

indicates how many standard deviations

is above or below the mean

y s

y

y

y

Comparing SAT and ACT Scores

SAT Math Eleanorrsquos score 680

SAT mean =500 sd=100 ACT Math Geraldrsquos score 27

ACT mean=18 sd=6 Eleanorrsquos z-score z=(680-500)100=18 Geraldrsquos z-score z=(27-18)6=15 Eleanorrsquos score is better

Z-scores add to zeroStudentInstitutional Support to Athletic Depts For the 9 Public ACC

Schools 2013 ($ millions)

School Support y - ybar Z-score

Maryland 155 64 179

UVA 131 40 112

Louisville 109 18 050

UNC 92 01 003

VaTech 79 -12 -034

FSU 79 -12 -034

GaTech 71 -20 -056

NCSU 65 -26 -073

Clemson 38 -53 -147

Mean=91000 s=35697

Sum = 0 Sum = 0

Recently the mean tuition at 4-yr public collegesuniversities in the US was $6185 with a standard deviation of $1804 In NC the mean tuition was $4320 What is NCrsquos z-score

1 103

2 -103

3 239

4 1865

5 -1865

Section 34Measures of Position (also called Measures of Relative Standing)

Quartiles

5-Number Summary

Interquartile Range Another Measure of Spread

Boxplots

m = median = 34

Q1= first quartile = 23

Q3= third quartile = 42

1 1 062 2 123 3 164 4 195 5 156 6 217 7 238 6 239 5 2510 4 2811 3 2912 2 3313 1 3414 2 3615 3 3716 4 3817 5 3918 6 4119 7 4220 6 4521 5 4722 4 4923 3 5324 2 5625 1 61

Quartiles Measuring spread by examining the middleThe first quartile Q1 is the value in the

sample that has 25 of the data at or

below it (Q1 is the median of the lower

half of the sorted data)

The third quartile Q3 is the value in the

sample that has 75 of the data at or

below it (Q3 is the median of the upper

half of the sorted data)

Quartiles and median divide data into 4 pieces

Q1 M Q3

14 14 14 14

Quartiles are common measures of spread

httpoirpncsueduiradmit

httpoirpncsueduunivpeer

University of Southern California

Economic Value of College Majors

Rules for Calculating QuartilesStep 1 find the median of all the data (the median divides the data in half)

Step 2a find the median of the lower half this median is Q1Step 2b find the median of the upper half this median is Q3

Importantwhen n is odd include the overall median in both halveswhen n is even do not include the overall median in either half

Example 2 4 6 8 10 12 14 16 18 20 n = 10

Median m = (10+12)2 = 222 = 11

Q1 median of lower half 2 4 6 8 10

Q1 = 6

Q3 median of upper half 12 14 16 18 20

Q3 = 16

11

Pulse Rates n = 138

Stem Leaves4

3 4 5889 5 00123344410 5 555678889923 6 0001111112223333334444423 6 5555666666777778888888816 7 0000011222233444423 7 5555566666677788888899910 8 000011222410 8 55556677894 9 00122 9 584 10 0223

101 11 1

Median mean of pulses in locations 69 amp 70 median= (70+70)2=70

Q1 median of lower half (lower half = 69 smallest pulses) Q1 = pulse in ordered position 35Q1 = 63

Q3 median of upper half (upper half = 69 largest pulses) Q3= pulse in position 35 from the high end Q3=78

Below are the weights of 31 linemen on the NCSU football team What is the

value of the first quartile Q1

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 287

2 2575

3 2635

4 2625

Interquartile range another measure of spread

lower quartile Q1

middle quartile median upper quartile Q3

interquartile range (IQR)

IQR = Q3 ndash Q1

measures spread of middle 50 of the data

Example beginning pulse rates

Q3 = 78 Q1 = 63

IQR = 78 ndash 63 = 15

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 36: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Suppose a 95 yr old is hiredstem leaf

1 8 9

2 1 2 8 9 9

3 2 3 8 9

4 0 1

5 6 7

6 4

7

8

9 5

Number of TD passes by NFL teams 2012-2013 season(stems are 10rsquos digit)

stem leaf

43

03247

2 6677789

2 01222233444

1 13467889

0 8

Pulse Rates n = 138

Stem Leaves 4 3 4 588 9 5 001233444 10 5 5556788899 23 6 00011111122233333344444 23 6 55556666667777788888888 16 7 00000112222334444 23 7 55555666666777888888999 10 8 0000112224 10 8 5555667789 4 9 0012 2 9 58 4 10 0223 10 1 11 1

AdvantagesDisadvantages of Stem-and-Leaf Displays

Advantages

1) each measurement displayed

2) ascending order in each stem row

3) relatively simple (data set not too large) Disadvantages

display becomes unwieldy for large data sets

Population of 185 US cities with between 100000 and 500000

Multiply stems by 100000

Back-to-back stem-and-leaf displays TD passes by NFL teams 1999-2000 2012-13multiply stems by 10

1999-2000 2012-13

2 4 03

6 3 7

2 3 24

6655 2 6677789

43322221100 2 01222233444

9998887666 1 67889

421 1 134

0 8

Below is a stem-and-leaf display for the pulse rates of 24 women at a health clinic How many pulses are between 67 and 77

Stems are 10rsquos digits

1 4

2 6

3 8

4 10

5 12

Other Graphical Methods for Data Time plots

plot observations in time order time on horizontal axis variable on vertical axis

Time series

measurements are taken at regular intervals (monthly unemployment quarterly GDP weather records electricity demand etc)

Heat maps word walls

Unemployment Rate by Educational Attainment

Water Use During Super Bowl XLV(Packers 31 Steelers 25)

Heat Maps

Word Wall (customer feedback)

Section 32Describing the Center of Data

Mean

Median

2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability (next section)

measures how ldquospread outrdquo the data is

Notation for Data Valuesand Sample Mean

1 2

1 2

3

The sample size is denoted by

For a variable denoted by its observations are denoted by

A common measure of center is the sample mean

The sample mean is denoted by

Shorte

n

n

y y yy

n

y

y y y y

y

n

1 21

1

ned expression for using the symbol

(uppercase Greek letter sigma)n

n

i

i n

i

i

y

y y y

yy

n

y

Simple Example of Sample Mean

Weekly TV viewing time in hours of 7 randomly selected 4th graders

19 40 16 12 10 6 and 97

1

7

1

19 40 16 12 10 6 9 112

11216

7 7

ii

ii

y

yy

Population Mean

1

population

population mea

Denoted by the Greek letter

is the size (for example =34000 for NCSU)

the value of is typically not known

we often use the sample mean

to estimat

n

e the unknown

N

ii

y

N N

y

N

value of

Connection Between Mean and Histogram

A histogram balances when supported at the mean Mean x = 1406

Histogram

0

10

20

30

40

50

60

70

118

5

125

5

132

5

139

5

146

5

153

5

16

05

Mo

re

Absences f rom Work

Fre

qu

en

cy

Frequency

The median anothermeasure of center

Given a set of n data values arranged in order of magnitude

Median= middle value n odd

mean of 2 middle values n even

Ex 2 4 6 8 10 n=5 median=6 Ex 2 4 6 8 n=4 median=(4+6)2=5

Student Pulse Rates (n=62)

38 59 60 60 62 62 63 63 64 64 65 67 68 70 70 70 70 70 70 70 71 71 72 72 73 74 74 75 75 75 75 76 77 77 77 77 78 78 79 79 80 80 80 84 84 85 85 87 90 90 91 92 93 94 94 95 96 96 96 98 98 103

Median = (75+76)2 = 755

The median splits the histogram into 2 halves of equal area

Mean balance pointMedian 50 area each half

mean 5526 years median 577years

Medians are used often

Year 2011 baseball salaries

Median $1450000 (max=$32000000 Alex Rodriguez min=$414000)

Median fan age MLB 45 NFL 43 NBA 41 NHL 39

Median existing home sales price May 2011 $166500 May 2010 $174600

Median household income (2008 dollars) 2009 $50221 2008 $52029

Examples Example n = 7

175 28 32 139 141 253 458 Example n = 7 (ordered) 28 32 139 141 175 253 458 Example n = 8

175 28 32 139 141 253 357 458

Example n =8 (ordered)

28 32 139 141 175 253 357 458

m = 141

m = (141+175)2 = 158

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496049604971524555467586

1 5245

2 49655

3 4960

4 4971

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496052455546497155877586

1 5245

2 49655

3 5546

4 4971

Properties of Mean Median1The mean and median are unique that is a

data set has only 1 mean and 1 median (the mean and median are not necessarily equal)

2The mean uses the value of every number in the data set the median does not

14

20 4 6Ex 2 4 6 8 5 5

4 2

21 4 6Ex 2 4 6 9 5 5

4 2

x m

x m

Example class pulse rates

53 64 67 67 70 76 77 77 78 83 84 85 85 89 90 90 90 90 91 96 98 103 140

23

1

23

844823

location 12th obs 85

ii

n

xx

m m

2010 2014 baseball salaries

2010

n = 845

mean = $3297828

median = $1330000

max = $33000000

2014

n = 848

mean = $3932912

median = $1456250

max = $28000000

>

Disadvantage of the mean

Can be greatly influenced by just a few observations that are much greater or much smaller than the rest of the data

Mean Median Maximum Baseball Salaries 1985 - 201419

85

1987

1989

1991

1993

1995

1997

1999

2001

2003

2005

2007

2009

2011

2013

200000

700000

1200000

1700000

2200000

2700000

3200000

3700000

0

5000000

10000000

15000000

20000000

25000000

30000000

35000000

Baseball Salaries Mean Median and Maximum 1985-2014

Mean Median Maximum

Year

Mea

n M

edia

n S

alar

y

Max

imu

m S

alar

y

Skewness comparing the mean and median

Skewed to the right (positively skewed) meangtmedian

53

490

102 7235 21 26 17 8 10 2 3 1 0 0 1

0

100

200

300

400

500

600

Freq

uenc

y

Salary ($1000s)

2011 Baseball Salaries

Skewed to the left negatively skewed

Mean lt median mean=78 median=87

Histogram of Exam Scores

0

10

20

30

20 30 40 50 60 70 80 90 100Exam Scores

Fre

qu

en

cy

Symmetric data

mean median approx equal

Bank Customers 1000-1100 am

0

5

10

15

20

Number of Customers

Fre

qu

en

cy

Section 33Describing Variability of Data

Standard Deviation

Using the Mean and Standard Deviation Together 68-95-997

Rule (Empirical Rule)

Recall 2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability

measures how ldquospread outrdquo the data is

Ways to measure variability

1 range=largest-smallest

ok sometimes in general too crude sensitive to one large or small obs

1

2 where

the middle is the mean

deviation of from the mean

( ) sum the deviations of all the s from

measure spread from the middle

i i

n

i ii

y

y y y

y y y y

1

( ) 0 always tells us nothingn

ii

y y

Example

1 2

1 2

1 2

1 2

sum of deviations from mean

49 51 50

( ) ( ) (49 50) (51 50) 1 1 0

0 100

Data set 1

Data set 2 50

( ) ( ) (0 50) (100 50) 50 50 0

x x x

x x x x

y y y

y y y y

The Sample Standard Deviation a measure of spread around the mean Square the deviation of each

observation from the mean find the square root of the ldquoaveragerdquo of these squared deviations

2

1

2

2 1

( )sample standard deviation

1

( )is called the sample variance

1

n

ii

n

ii

y ys

n

y ys

n

Calculations hellip

Mean = 634

Sum of squared deviations from mean = 852

(n minus 1) = 13 (n minus 1) is called degrees freedom (df)

s2 = variance = 85213 = 655 square inches

s = standard deviation = radic655 = 256 inches

Women height (inches)i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

2

1

2 )(1

1xx

ns

n

i

1 First calculate the variance s22 Then take the square root to get the

standard deviation s

2

1

)(1

1xx

ns

n

i

Meanplusmn 1 sd

Wersquoll never calculate these by hand so make sure to know how to get the standard deviation using your calculator Excel or other software

Population Standard Deviation

2

1

Denoted by the lower case Greek letter

is the size (for example =34000 for NCSU)

is the mean

( )population standard deviation

va

po

lue of typically not known

us

pulation

populatio

e

n

N

ii

N N

y

N

s

to estimate value of

Remarks

1 The standard deviation of a set of measurements is an estimate of the likely size of the chance error in a single measurement

Remarks (cont)

2 Note that s and s are always greater than or equal to zero

3 The larger the value of s (or s ) the greater the spread of the data

When does s=0 When does s =0

When all data values are the same

Remarks (cont)4 The standard deviation is the most

commonly used measure of risk in finance and businessndash Stocks Mutual Funds etc

5 Variance s2 sample variance 2 population variance Units are squared units of the original data square $ square gallons

Review Properties of s and s s and s are always greater than or

equal to 0

when does s = 0 s = 0 The larger the value of s (or s) the

greater the spread of the data the standard deviation of a set of

measurements is an estimate of the likely size of the chance error in a single measurement

Summary of Notation

2

SAMPLE

sample mean

sample median

sample variance

sample stand dev

y

m

s

s

2

POPULATION

population mean

population median

population variance

population stand dev

m

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

68-95-997 rule

Mean andStandard Deviation

(numerical)

Histogram(graphical)

68-95-997 rule

The 68-95-997 ruleIf the histogram of the data is

approximately bell-shaped then1) approximately of the measurements

are of the mean

that is in ( )

2) approximately of the measurement

68

within 1 standard deviation

95

within 2 standard deviation

s

are of the meas n

that is

y s y s

almost all

within 3 standard deviation

in ( 2 2 )

3) the measurements

are of the mean

that is in ( 3 3 )

s

y s y s

y s y s

68-95-997 rule 68 within 1 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

68

3434

y-s y y+s

68-95-997 rule 95 within 2 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

95

475 475

y-2s y y+2s

Example textbook costs

37548

4272

50

y

s

n

286 291 307 308 315 316 327328 340 342 346 347 348 348 349354 355 355 360 361 364 367 369371 373 377 380 381 382 385 385387 390 390 397 398 409 409 410418 422 424 425 426 428 433 434437 440 480

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( ) (33276 41820)

32percentage of data values in this interval 64

5068-95-997 rule 68

y s

y s y s

1 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 2 2 ) (29004 46092)

48percentage of data values in this interval 96

5068-95-997 rule 95

y s

y s y s

2 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 3 3 ) (24732 50364)

50percentage of data values in this interval 100

5068-95-997 rule 997

y s

y s y s

3 standard deviation interval about the mean

The best estimate of the standard deviation of the menrsquos weights

displayed in this dotplot is

1 10

2 15

3 20

4 40

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

Preceding slides Next

Z-scores Standardized Data Values

Measures the distance of a number from the mean in units of

the standard deviation

z-score corresponding to y

where

original data value

the sample mean

s the sample standard deviation

the z-score corresponding to

y yz

s

y

y

z y

Exam 1 y1 = 88 s1 = 6 your exam 1 score 91

Exam 2 y2 = 88 s2 = 10 your exam 2 score 92

Which score is better

1

2

91 88 3z 5

6 692 88 4

z 410 10

91 on exam 1 is better than 92 on exam 2

If data has mean and standard deviation

then standardizing a particular value of

indicates how many standard deviations

is above or below the mean

y s

y

y

y

Comparing SAT and ACT Scores

SAT Math Eleanorrsquos score 680

SAT mean =500 sd=100 ACT Math Geraldrsquos score 27

ACT mean=18 sd=6 Eleanorrsquos z-score z=(680-500)100=18 Geraldrsquos z-score z=(27-18)6=15 Eleanorrsquos score is better

Z-scores add to zeroStudentInstitutional Support to Athletic Depts For the 9 Public ACC

Schools 2013 ($ millions)

School Support y - ybar Z-score

Maryland 155 64 179

UVA 131 40 112

Louisville 109 18 050

UNC 92 01 003

VaTech 79 -12 -034

FSU 79 -12 -034

GaTech 71 -20 -056

NCSU 65 -26 -073

Clemson 38 -53 -147

Mean=91000 s=35697

Sum = 0 Sum = 0

Recently the mean tuition at 4-yr public collegesuniversities in the US was $6185 with a standard deviation of $1804 In NC the mean tuition was $4320 What is NCrsquos z-score

1 103

2 -103

3 239

4 1865

5 -1865

Section 34Measures of Position (also called Measures of Relative Standing)

Quartiles

5-Number Summary

Interquartile Range Another Measure of Spread

Boxplots

m = median = 34

Q1= first quartile = 23

Q3= third quartile = 42

1 1 062 2 123 3 164 4 195 5 156 6 217 7 238 6 239 5 2510 4 2811 3 2912 2 3313 1 3414 2 3615 3 3716 4 3817 5 3918 6 4119 7 4220 6 4521 5 4722 4 4923 3 5324 2 5625 1 61

Quartiles Measuring spread by examining the middleThe first quartile Q1 is the value in the

sample that has 25 of the data at or

below it (Q1 is the median of the lower

half of the sorted data)

The third quartile Q3 is the value in the

sample that has 75 of the data at or

below it (Q3 is the median of the upper

half of the sorted data)

Quartiles and median divide data into 4 pieces

Q1 M Q3

14 14 14 14

Quartiles are common measures of spread

httpoirpncsueduiradmit

httpoirpncsueduunivpeer

University of Southern California

Economic Value of College Majors

Rules for Calculating QuartilesStep 1 find the median of all the data (the median divides the data in half)

Step 2a find the median of the lower half this median is Q1Step 2b find the median of the upper half this median is Q3

Importantwhen n is odd include the overall median in both halveswhen n is even do not include the overall median in either half

Example 2 4 6 8 10 12 14 16 18 20 n = 10

Median m = (10+12)2 = 222 = 11

Q1 median of lower half 2 4 6 8 10

Q1 = 6

Q3 median of upper half 12 14 16 18 20

Q3 = 16

11

Pulse Rates n = 138

Stem Leaves4

3 4 5889 5 00123344410 5 555678889923 6 0001111112223333334444423 6 5555666666777778888888816 7 0000011222233444423 7 5555566666677788888899910 8 000011222410 8 55556677894 9 00122 9 584 10 0223

101 11 1

Median mean of pulses in locations 69 amp 70 median= (70+70)2=70

Q1 median of lower half (lower half = 69 smallest pulses) Q1 = pulse in ordered position 35Q1 = 63

Q3 median of upper half (upper half = 69 largest pulses) Q3= pulse in position 35 from the high end Q3=78

Below are the weights of 31 linemen on the NCSU football team What is the

value of the first quartile Q1

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 287

2 2575

3 2635

4 2625

Interquartile range another measure of spread

lower quartile Q1

middle quartile median upper quartile Q3

interquartile range (IQR)

IQR = Q3 ndash Q1

measures spread of middle 50 of the data

Example beginning pulse rates

Q3 = 78 Q1 = 63

IQR = 78 ndash 63 = 15

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 37: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Number of TD passes by NFL teams 2012-2013 season(stems are 10rsquos digit)

stem leaf

43

03247

2 6677789

2 01222233444

1 13467889

0 8

Pulse Rates n = 138

Stem Leaves 4 3 4 588 9 5 001233444 10 5 5556788899 23 6 00011111122233333344444 23 6 55556666667777788888888 16 7 00000112222334444 23 7 55555666666777888888999 10 8 0000112224 10 8 5555667789 4 9 0012 2 9 58 4 10 0223 10 1 11 1

AdvantagesDisadvantages of Stem-and-Leaf Displays

Advantages

1) each measurement displayed

2) ascending order in each stem row

3) relatively simple (data set not too large) Disadvantages

display becomes unwieldy for large data sets

Population of 185 US cities with between 100000 and 500000

Multiply stems by 100000

Back-to-back stem-and-leaf displays TD passes by NFL teams 1999-2000 2012-13multiply stems by 10

1999-2000 2012-13

2 4 03

6 3 7

2 3 24

6655 2 6677789

43322221100 2 01222233444

9998887666 1 67889

421 1 134

0 8

Below is a stem-and-leaf display for the pulse rates of 24 women at a health clinic How many pulses are between 67 and 77

Stems are 10rsquos digits

1 4

2 6

3 8

4 10

5 12

Other Graphical Methods for Data Time plots

plot observations in time order time on horizontal axis variable on vertical axis

Time series

measurements are taken at regular intervals (monthly unemployment quarterly GDP weather records electricity demand etc)

Heat maps word walls

Unemployment Rate by Educational Attainment

Water Use During Super Bowl XLV(Packers 31 Steelers 25)

Heat Maps

Word Wall (customer feedback)

Section 32Describing the Center of Data

Mean

Median

2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability (next section)

measures how ldquospread outrdquo the data is

Notation for Data Valuesand Sample Mean

1 2

1 2

3

The sample size is denoted by

For a variable denoted by its observations are denoted by

A common measure of center is the sample mean

The sample mean is denoted by

Shorte

n

n

y y yy

n

y

y y y y

y

n

1 21

1

ned expression for using the symbol

(uppercase Greek letter sigma)n

n

i

i n

i

i

y

y y y

yy

n

y

Simple Example of Sample Mean

Weekly TV viewing time in hours of 7 randomly selected 4th graders

19 40 16 12 10 6 and 97

1

7

1

19 40 16 12 10 6 9 112

11216

7 7

ii

ii

y

yy

Population Mean

1

population

population mea

Denoted by the Greek letter

is the size (for example =34000 for NCSU)

the value of is typically not known

we often use the sample mean

to estimat

n

e the unknown

N

ii

y

N N

y

N

value of

Connection Between Mean and Histogram

A histogram balances when supported at the mean Mean x = 1406

Histogram

0

10

20

30

40

50

60

70

118

5

125

5

132

5

139

5

146

5

153

5

16

05

Mo

re

Absences f rom Work

Fre

qu

en

cy

Frequency

The median anothermeasure of center

Given a set of n data values arranged in order of magnitude

Median= middle value n odd

mean of 2 middle values n even

Ex 2 4 6 8 10 n=5 median=6 Ex 2 4 6 8 n=4 median=(4+6)2=5

Student Pulse Rates (n=62)

38 59 60 60 62 62 63 63 64 64 65 67 68 70 70 70 70 70 70 70 71 71 72 72 73 74 74 75 75 75 75 76 77 77 77 77 78 78 79 79 80 80 80 84 84 85 85 87 90 90 91 92 93 94 94 95 96 96 96 98 98 103

Median = (75+76)2 = 755

The median splits the histogram into 2 halves of equal area

Mean balance pointMedian 50 area each half

mean 5526 years median 577years

Medians are used often

Year 2011 baseball salaries

Median $1450000 (max=$32000000 Alex Rodriguez min=$414000)

Median fan age MLB 45 NFL 43 NBA 41 NHL 39

Median existing home sales price May 2011 $166500 May 2010 $174600

Median household income (2008 dollars) 2009 $50221 2008 $52029

Examples Example n = 7

175 28 32 139 141 253 458 Example n = 7 (ordered) 28 32 139 141 175 253 458 Example n = 8

175 28 32 139 141 253 357 458

Example n =8 (ordered)

28 32 139 141 175 253 357 458

m = 141

m = (141+175)2 = 158

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496049604971524555467586

1 5245

2 49655

3 4960

4 4971

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496052455546497155877586

1 5245

2 49655

3 5546

4 4971

Properties of Mean Median1The mean and median are unique that is a

data set has only 1 mean and 1 median (the mean and median are not necessarily equal)

2The mean uses the value of every number in the data set the median does not

14

20 4 6Ex 2 4 6 8 5 5

4 2

21 4 6Ex 2 4 6 9 5 5

4 2

x m

x m

Example class pulse rates

53 64 67 67 70 76 77 77 78 83 84 85 85 89 90 90 90 90 91 96 98 103 140

23

1

23

844823

location 12th obs 85

ii

n

xx

m m

2010 2014 baseball salaries

2010

n = 845

mean = $3297828

median = $1330000

max = $33000000

2014

n = 848

mean = $3932912

median = $1456250

max = $28000000

>

Disadvantage of the mean

Can be greatly influenced by just a few observations that are much greater or much smaller than the rest of the data

Mean Median Maximum Baseball Salaries 1985 - 201419

85

1987

1989

1991

1993

1995

1997

1999

2001

2003

2005

2007

2009

2011

2013

200000

700000

1200000

1700000

2200000

2700000

3200000

3700000

0

5000000

10000000

15000000

20000000

25000000

30000000

35000000

Baseball Salaries Mean Median and Maximum 1985-2014

Mean Median Maximum

Year

Mea

n M

edia

n S

alar

y

Max

imu

m S

alar

y

Skewness comparing the mean and median

Skewed to the right (positively skewed) meangtmedian

53

490

102 7235 21 26 17 8 10 2 3 1 0 0 1

0

100

200

300

400

500

600

Freq

uenc

y

Salary ($1000s)

2011 Baseball Salaries

Skewed to the left negatively skewed

Mean lt median mean=78 median=87

Histogram of Exam Scores

0

10

20

30

20 30 40 50 60 70 80 90 100Exam Scores

Fre

qu

en

cy

Symmetric data

mean median approx equal

Bank Customers 1000-1100 am

0

5

10

15

20

Number of Customers

Fre

qu

en

cy

Section 33Describing Variability of Data

Standard Deviation

Using the Mean and Standard Deviation Together 68-95-997

Rule (Empirical Rule)

Recall 2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability

measures how ldquospread outrdquo the data is

Ways to measure variability

1 range=largest-smallest

ok sometimes in general too crude sensitive to one large or small obs

1

2 where

the middle is the mean

deviation of from the mean

( ) sum the deviations of all the s from

measure spread from the middle

i i

n

i ii

y

y y y

y y y y

1

( ) 0 always tells us nothingn

ii

y y

Example

1 2

1 2

1 2

1 2

sum of deviations from mean

49 51 50

( ) ( ) (49 50) (51 50) 1 1 0

0 100

Data set 1

Data set 2 50

( ) ( ) (0 50) (100 50) 50 50 0

x x x

x x x x

y y y

y y y y

The Sample Standard Deviation a measure of spread around the mean Square the deviation of each

observation from the mean find the square root of the ldquoaveragerdquo of these squared deviations

2

1

2

2 1

( )sample standard deviation

1

( )is called the sample variance

1

n

ii

n

ii

y ys

n

y ys

n

Calculations hellip

Mean = 634

Sum of squared deviations from mean = 852

(n minus 1) = 13 (n minus 1) is called degrees freedom (df)

s2 = variance = 85213 = 655 square inches

s = standard deviation = radic655 = 256 inches

Women height (inches)i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

2

1

2 )(1

1xx

ns

n

i

1 First calculate the variance s22 Then take the square root to get the

standard deviation s

2

1

)(1

1xx

ns

n

i

Meanplusmn 1 sd

Wersquoll never calculate these by hand so make sure to know how to get the standard deviation using your calculator Excel or other software

Population Standard Deviation

2

1

Denoted by the lower case Greek letter

is the size (for example =34000 for NCSU)

is the mean

( )population standard deviation

va

po

lue of typically not known

us

pulation

populatio

e

n

N

ii

N N

y

N

s

to estimate value of

Remarks

1 The standard deviation of a set of measurements is an estimate of the likely size of the chance error in a single measurement

Remarks (cont)

2 Note that s and s are always greater than or equal to zero

3 The larger the value of s (or s ) the greater the spread of the data

When does s=0 When does s =0

When all data values are the same

Remarks (cont)4 The standard deviation is the most

commonly used measure of risk in finance and businessndash Stocks Mutual Funds etc

5 Variance s2 sample variance 2 population variance Units are squared units of the original data square $ square gallons

Review Properties of s and s s and s are always greater than or

equal to 0

when does s = 0 s = 0 The larger the value of s (or s) the

greater the spread of the data the standard deviation of a set of

measurements is an estimate of the likely size of the chance error in a single measurement

Summary of Notation

2

SAMPLE

sample mean

sample median

sample variance

sample stand dev

y

m

s

s

2

POPULATION

population mean

population median

population variance

population stand dev

m

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

68-95-997 rule

Mean andStandard Deviation

(numerical)

Histogram(graphical)

68-95-997 rule

The 68-95-997 ruleIf the histogram of the data is

approximately bell-shaped then1) approximately of the measurements

are of the mean

that is in ( )

2) approximately of the measurement

68

within 1 standard deviation

95

within 2 standard deviation

s

are of the meas n

that is

y s y s

almost all

within 3 standard deviation

in ( 2 2 )

3) the measurements

are of the mean

that is in ( 3 3 )

s

y s y s

y s y s

68-95-997 rule 68 within 1 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

68

3434

y-s y y+s

68-95-997 rule 95 within 2 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

95

475 475

y-2s y y+2s

Example textbook costs

37548

4272

50

y

s

n

286 291 307 308 315 316 327328 340 342 346 347 348 348 349354 355 355 360 361 364 367 369371 373 377 380 381 382 385 385387 390 390 397 398 409 409 410418 422 424 425 426 428 433 434437 440 480

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( ) (33276 41820)

32percentage of data values in this interval 64

5068-95-997 rule 68

y s

y s y s

1 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 2 2 ) (29004 46092)

48percentage of data values in this interval 96

5068-95-997 rule 95

y s

y s y s

2 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 3 3 ) (24732 50364)

50percentage of data values in this interval 100

5068-95-997 rule 997

y s

y s y s

3 standard deviation interval about the mean

The best estimate of the standard deviation of the menrsquos weights

displayed in this dotplot is

1 10

2 15

3 20

4 40

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

Preceding slides Next

Z-scores Standardized Data Values

Measures the distance of a number from the mean in units of

the standard deviation

z-score corresponding to y

where

original data value

the sample mean

s the sample standard deviation

the z-score corresponding to

y yz

s

y

y

z y

Exam 1 y1 = 88 s1 = 6 your exam 1 score 91

Exam 2 y2 = 88 s2 = 10 your exam 2 score 92

Which score is better

1

2

91 88 3z 5

6 692 88 4

z 410 10

91 on exam 1 is better than 92 on exam 2

If data has mean and standard deviation

then standardizing a particular value of

indicates how many standard deviations

is above or below the mean

y s

y

y

y

Comparing SAT and ACT Scores

SAT Math Eleanorrsquos score 680

SAT mean =500 sd=100 ACT Math Geraldrsquos score 27

ACT mean=18 sd=6 Eleanorrsquos z-score z=(680-500)100=18 Geraldrsquos z-score z=(27-18)6=15 Eleanorrsquos score is better

Z-scores add to zeroStudentInstitutional Support to Athletic Depts For the 9 Public ACC

Schools 2013 ($ millions)

School Support y - ybar Z-score

Maryland 155 64 179

UVA 131 40 112

Louisville 109 18 050

UNC 92 01 003

VaTech 79 -12 -034

FSU 79 -12 -034

GaTech 71 -20 -056

NCSU 65 -26 -073

Clemson 38 -53 -147

Mean=91000 s=35697

Sum = 0 Sum = 0

Recently the mean tuition at 4-yr public collegesuniversities in the US was $6185 with a standard deviation of $1804 In NC the mean tuition was $4320 What is NCrsquos z-score

1 103

2 -103

3 239

4 1865

5 -1865

Section 34Measures of Position (also called Measures of Relative Standing)

Quartiles

5-Number Summary

Interquartile Range Another Measure of Spread

Boxplots

m = median = 34

Q1= first quartile = 23

Q3= third quartile = 42

1 1 062 2 123 3 164 4 195 5 156 6 217 7 238 6 239 5 2510 4 2811 3 2912 2 3313 1 3414 2 3615 3 3716 4 3817 5 3918 6 4119 7 4220 6 4521 5 4722 4 4923 3 5324 2 5625 1 61

Quartiles Measuring spread by examining the middleThe first quartile Q1 is the value in the

sample that has 25 of the data at or

below it (Q1 is the median of the lower

half of the sorted data)

The third quartile Q3 is the value in the

sample that has 75 of the data at or

below it (Q3 is the median of the upper

half of the sorted data)

Quartiles and median divide data into 4 pieces

Q1 M Q3

14 14 14 14

Quartiles are common measures of spread

httpoirpncsueduiradmit

httpoirpncsueduunivpeer

University of Southern California

Economic Value of College Majors

Rules for Calculating QuartilesStep 1 find the median of all the data (the median divides the data in half)

Step 2a find the median of the lower half this median is Q1Step 2b find the median of the upper half this median is Q3

Importantwhen n is odd include the overall median in both halveswhen n is even do not include the overall median in either half

Example 2 4 6 8 10 12 14 16 18 20 n = 10

Median m = (10+12)2 = 222 = 11

Q1 median of lower half 2 4 6 8 10

Q1 = 6

Q3 median of upper half 12 14 16 18 20

Q3 = 16

11

Pulse Rates n = 138

Stem Leaves4

3 4 5889 5 00123344410 5 555678889923 6 0001111112223333334444423 6 5555666666777778888888816 7 0000011222233444423 7 5555566666677788888899910 8 000011222410 8 55556677894 9 00122 9 584 10 0223

101 11 1

Median mean of pulses in locations 69 amp 70 median= (70+70)2=70

Q1 median of lower half (lower half = 69 smallest pulses) Q1 = pulse in ordered position 35Q1 = 63

Q3 median of upper half (upper half = 69 largest pulses) Q3= pulse in position 35 from the high end Q3=78

Below are the weights of 31 linemen on the NCSU football team What is the

value of the first quartile Q1

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 287

2 2575

3 2635

4 2625

Interquartile range another measure of spread

lower quartile Q1

middle quartile median upper quartile Q3

interquartile range (IQR)

IQR = Q3 ndash Q1

measures spread of middle 50 of the data

Example beginning pulse rates

Q3 = 78 Q1 = 63

IQR = 78 ndash 63 = 15

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 38: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Pulse Rates n = 138

Stem Leaves 4 3 4 588 9 5 001233444 10 5 5556788899 23 6 00011111122233333344444 23 6 55556666667777788888888 16 7 00000112222334444 23 7 55555666666777888888999 10 8 0000112224 10 8 5555667789 4 9 0012 2 9 58 4 10 0223 10 1 11 1

AdvantagesDisadvantages of Stem-and-Leaf Displays

Advantages

1) each measurement displayed

2) ascending order in each stem row

3) relatively simple (data set not too large) Disadvantages

display becomes unwieldy for large data sets

Population of 185 US cities with between 100000 and 500000

Multiply stems by 100000

Back-to-back stem-and-leaf displays TD passes by NFL teams 1999-2000 2012-13multiply stems by 10

1999-2000 2012-13

2 4 03

6 3 7

2 3 24

6655 2 6677789

43322221100 2 01222233444

9998887666 1 67889

421 1 134

0 8

Below is a stem-and-leaf display for the pulse rates of 24 women at a health clinic How many pulses are between 67 and 77

Stems are 10rsquos digits

1 4

2 6

3 8

4 10

5 12

Other Graphical Methods for Data Time plots

plot observations in time order time on horizontal axis variable on vertical axis

Time series

measurements are taken at regular intervals (monthly unemployment quarterly GDP weather records electricity demand etc)

Heat maps word walls

Unemployment Rate by Educational Attainment

Water Use During Super Bowl XLV(Packers 31 Steelers 25)

Heat Maps

Word Wall (customer feedback)

Section 32Describing the Center of Data

Mean

Median

2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability (next section)

measures how ldquospread outrdquo the data is

Notation for Data Valuesand Sample Mean

1 2

1 2

3

The sample size is denoted by

For a variable denoted by its observations are denoted by

A common measure of center is the sample mean

The sample mean is denoted by

Shorte

n

n

y y yy

n

y

y y y y

y

n

1 21

1

ned expression for using the symbol

(uppercase Greek letter sigma)n

n

i

i n

i

i

y

y y y

yy

n

y

Simple Example of Sample Mean

Weekly TV viewing time in hours of 7 randomly selected 4th graders

19 40 16 12 10 6 and 97

1

7

1

19 40 16 12 10 6 9 112

11216

7 7

ii

ii

y

yy

Population Mean

1

population

population mea

Denoted by the Greek letter

is the size (for example =34000 for NCSU)

the value of is typically not known

we often use the sample mean

to estimat

n

e the unknown

N

ii

y

N N

y

N

value of

Connection Between Mean and Histogram

A histogram balances when supported at the mean Mean x = 1406

Histogram

0

10

20

30

40

50

60

70

118

5

125

5

132

5

139

5

146

5

153

5

16

05

Mo

re

Absences f rom Work

Fre

qu

en

cy

Frequency

The median anothermeasure of center

Given a set of n data values arranged in order of magnitude

Median= middle value n odd

mean of 2 middle values n even

Ex 2 4 6 8 10 n=5 median=6 Ex 2 4 6 8 n=4 median=(4+6)2=5

Student Pulse Rates (n=62)

38 59 60 60 62 62 63 63 64 64 65 67 68 70 70 70 70 70 70 70 71 71 72 72 73 74 74 75 75 75 75 76 77 77 77 77 78 78 79 79 80 80 80 84 84 85 85 87 90 90 91 92 93 94 94 95 96 96 96 98 98 103

Median = (75+76)2 = 755

The median splits the histogram into 2 halves of equal area

Mean balance pointMedian 50 area each half

mean 5526 years median 577years

Medians are used often

Year 2011 baseball salaries

Median $1450000 (max=$32000000 Alex Rodriguez min=$414000)

Median fan age MLB 45 NFL 43 NBA 41 NHL 39

Median existing home sales price May 2011 $166500 May 2010 $174600

Median household income (2008 dollars) 2009 $50221 2008 $52029

Examples Example n = 7

175 28 32 139 141 253 458 Example n = 7 (ordered) 28 32 139 141 175 253 458 Example n = 8

175 28 32 139 141 253 357 458

Example n =8 (ordered)

28 32 139 141 175 253 357 458

m = 141

m = (141+175)2 = 158

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496049604971524555467586

1 5245

2 49655

3 4960

4 4971

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496052455546497155877586

1 5245

2 49655

3 5546

4 4971

Properties of Mean Median1The mean and median are unique that is a

data set has only 1 mean and 1 median (the mean and median are not necessarily equal)

2The mean uses the value of every number in the data set the median does not

14

20 4 6Ex 2 4 6 8 5 5

4 2

21 4 6Ex 2 4 6 9 5 5

4 2

x m

x m

Example class pulse rates

53 64 67 67 70 76 77 77 78 83 84 85 85 89 90 90 90 90 91 96 98 103 140

23

1

23

844823

location 12th obs 85

ii

n

xx

m m

2010 2014 baseball salaries

2010

n = 845

mean = $3297828

median = $1330000

max = $33000000

2014

n = 848

mean = $3932912

median = $1456250

max = $28000000

>

Disadvantage of the mean

Can be greatly influenced by just a few observations that are much greater or much smaller than the rest of the data

Mean Median Maximum Baseball Salaries 1985 - 201419

85

1987

1989

1991

1993

1995

1997

1999

2001

2003

2005

2007

2009

2011

2013

200000

700000

1200000

1700000

2200000

2700000

3200000

3700000

0

5000000

10000000

15000000

20000000

25000000

30000000

35000000

Baseball Salaries Mean Median and Maximum 1985-2014

Mean Median Maximum

Year

Mea

n M

edia

n S

alar

y

Max

imu

m S

alar

y

Skewness comparing the mean and median

Skewed to the right (positively skewed) meangtmedian

53

490

102 7235 21 26 17 8 10 2 3 1 0 0 1

0

100

200

300

400

500

600

Freq

uenc

y

Salary ($1000s)

2011 Baseball Salaries

Skewed to the left negatively skewed

Mean lt median mean=78 median=87

Histogram of Exam Scores

0

10

20

30

20 30 40 50 60 70 80 90 100Exam Scores

Fre

qu

en

cy

Symmetric data

mean median approx equal

Bank Customers 1000-1100 am

0

5

10

15

20

Number of Customers

Fre

qu

en

cy

Section 33Describing Variability of Data

Standard Deviation

Using the Mean and Standard Deviation Together 68-95-997

Rule (Empirical Rule)

Recall 2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability

measures how ldquospread outrdquo the data is

Ways to measure variability

1 range=largest-smallest

ok sometimes in general too crude sensitive to one large or small obs

1

2 where

the middle is the mean

deviation of from the mean

( ) sum the deviations of all the s from

measure spread from the middle

i i

n

i ii

y

y y y

y y y y

1

( ) 0 always tells us nothingn

ii

y y

Example

1 2

1 2

1 2

1 2

sum of deviations from mean

49 51 50

( ) ( ) (49 50) (51 50) 1 1 0

0 100

Data set 1

Data set 2 50

( ) ( ) (0 50) (100 50) 50 50 0

x x x

x x x x

y y y

y y y y

The Sample Standard Deviation a measure of spread around the mean Square the deviation of each

observation from the mean find the square root of the ldquoaveragerdquo of these squared deviations

2

1

2

2 1

( )sample standard deviation

1

( )is called the sample variance

1

n

ii

n

ii

y ys

n

y ys

n

Calculations hellip

Mean = 634

Sum of squared deviations from mean = 852

(n minus 1) = 13 (n minus 1) is called degrees freedom (df)

s2 = variance = 85213 = 655 square inches

s = standard deviation = radic655 = 256 inches

Women height (inches)i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

2

1

2 )(1

1xx

ns

n

i

1 First calculate the variance s22 Then take the square root to get the

standard deviation s

2

1

)(1

1xx

ns

n

i

Meanplusmn 1 sd

Wersquoll never calculate these by hand so make sure to know how to get the standard deviation using your calculator Excel or other software

Population Standard Deviation

2

1

Denoted by the lower case Greek letter

is the size (for example =34000 for NCSU)

is the mean

( )population standard deviation

va

po

lue of typically not known

us

pulation

populatio

e

n

N

ii

N N

y

N

s

to estimate value of

Remarks

1 The standard deviation of a set of measurements is an estimate of the likely size of the chance error in a single measurement

Remarks (cont)

2 Note that s and s are always greater than or equal to zero

3 The larger the value of s (or s ) the greater the spread of the data

When does s=0 When does s =0

When all data values are the same

Remarks (cont)4 The standard deviation is the most

commonly used measure of risk in finance and businessndash Stocks Mutual Funds etc

5 Variance s2 sample variance 2 population variance Units are squared units of the original data square $ square gallons

Review Properties of s and s s and s are always greater than or

equal to 0

when does s = 0 s = 0 The larger the value of s (or s) the

greater the spread of the data the standard deviation of a set of

measurements is an estimate of the likely size of the chance error in a single measurement

Summary of Notation

2

SAMPLE

sample mean

sample median

sample variance

sample stand dev

y

m

s

s

2

POPULATION

population mean

population median

population variance

population stand dev

m

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

68-95-997 rule

Mean andStandard Deviation

(numerical)

Histogram(graphical)

68-95-997 rule

The 68-95-997 ruleIf the histogram of the data is

approximately bell-shaped then1) approximately of the measurements

are of the mean

that is in ( )

2) approximately of the measurement

68

within 1 standard deviation

95

within 2 standard deviation

s

are of the meas n

that is

y s y s

almost all

within 3 standard deviation

in ( 2 2 )

3) the measurements

are of the mean

that is in ( 3 3 )

s

y s y s

y s y s

68-95-997 rule 68 within 1 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

68

3434

y-s y y+s

68-95-997 rule 95 within 2 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

95

475 475

y-2s y y+2s

Example textbook costs

37548

4272

50

y

s

n

286 291 307 308 315 316 327328 340 342 346 347 348 348 349354 355 355 360 361 364 367 369371 373 377 380 381 382 385 385387 390 390 397 398 409 409 410418 422 424 425 426 428 433 434437 440 480

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( ) (33276 41820)

32percentage of data values in this interval 64

5068-95-997 rule 68

y s

y s y s

1 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 2 2 ) (29004 46092)

48percentage of data values in this interval 96

5068-95-997 rule 95

y s

y s y s

2 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 3 3 ) (24732 50364)

50percentage of data values in this interval 100

5068-95-997 rule 997

y s

y s y s

3 standard deviation interval about the mean

The best estimate of the standard deviation of the menrsquos weights

displayed in this dotplot is

1 10

2 15

3 20

4 40

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

Preceding slides Next

Z-scores Standardized Data Values

Measures the distance of a number from the mean in units of

the standard deviation

z-score corresponding to y

where

original data value

the sample mean

s the sample standard deviation

the z-score corresponding to

y yz

s

y

y

z y

Exam 1 y1 = 88 s1 = 6 your exam 1 score 91

Exam 2 y2 = 88 s2 = 10 your exam 2 score 92

Which score is better

1

2

91 88 3z 5

6 692 88 4

z 410 10

91 on exam 1 is better than 92 on exam 2

If data has mean and standard deviation

then standardizing a particular value of

indicates how many standard deviations

is above or below the mean

y s

y

y

y

Comparing SAT and ACT Scores

SAT Math Eleanorrsquos score 680

SAT mean =500 sd=100 ACT Math Geraldrsquos score 27

ACT mean=18 sd=6 Eleanorrsquos z-score z=(680-500)100=18 Geraldrsquos z-score z=(27-18)6=15 Eleanorrsquos score is better

Z-scores add to zeroStudentInstitutional Support to Athletic Depts For the 9 Public ACC

Schools 2013 ($ millions)

School Support y - ybar Z-score

Maryland 155 64 179

UVA 131 40 112

Louisville 109 18 050

UNC 92 01 003

VaTech 79 -12 -034

FSU 79 -12 -034

GaTech 71 -20 -056

NCSU 65 -26 -073

Clemson 38 -53 -147

Mean=91000 s=35697

Sum = 0 Sum = 0

Recently the mean tuition at 4-yr public collegesuniversities in the US was $6185 with a standard deviation of $1804 In NC the mean tuition was $4320 What is NCrsquos z-score

1 103

2 -103

3 239

4 1865

5 -1865

Section 34Measures of Position (also called Measures of Relative Standing)

Quartiles

5-Number Summary

Interquartile Range Another Measure of Spread

Boxplots

m = median = 34

Q1= first quartile = 23

Q3= third quartile = 42

1 1 062 2 123 3 164 4 195 5 156 6 217 7 238 6 239 5 2510 4 2811 3 2912 2 3313 1 3414 2 3615 3 3716 4 3817 5 3918 6 4119 7 4220 6 4521 5 4722 4 4923 3 5324 2 5625 1 61

Quartiles Measuring spread by examining the middleThe first quartile Q1 is the value in the

sample that has 25 of the data at or

below it (Q1 is the median of the lower

half of the sorted data)

The third quartile Q3 is the value in the

sample that has 75 of the data at or

below it (Q3 is the median of the upper

half of the sorted data)

Quartiles and median divide data into 4 pieces

Q1 M Q3

14 14 14 14

Quartiles are common measures of spread

httpoirpncsueduiradmit

httpoirpncsueduunivpeer

University of Southern California

Economic Value of College Majors

Rules for Calculating QuartilesStep 1 find the median of all the data (the median divides the data in half)

Step 2a find the median of the lower half this median is Q1Step 2b find the median of the upper half this median is Q3

Importantwhen n is odd include the overall median in both halveswhen n is even do not include the overall median in either half

Example 2 4 6 8 10 12 14 16 18 20 n = 10

Median m = (10+12)2 = 222 = 11

Q1 median of lower half 2 4 6 8 10

Q1 = 6

Q3 median of upper half 12 14 16 18 20

Q3 = 16

11

Pulse Rates n = 138

Stem Leaves4

3 4 5889 5 00123344410 5 555678889923 6 0001111112223333334444423 6 5555666666777778888888816 7 0000011222233444423 7 5555566666677788888899910 8 000011222410 8 55556677894 9 00122 9 584 10 0223

101 11 1

Median mean of pulses in locations 69 amp 70 median= (70+70)2=70

Q1 median of lower half (lower half = 69 smallest pulses) Q1 = pulse in ordered position 35Q1 = 63

Q3 median of upper half (upper half = 69 largest pulses) Q3= pulse in position 35 from the high end Q3=78

Below are the weights of 31 linemen on the NCSU football team What is the

value of the first quartile Q1

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 287

2 2575

3 2635

4 2625

Interquartile range another measure of spread

lower quartile Q1

middle quartile median upper quartile Q3

interquartile range (IQR)

IQR = Q3 ndash Q1

measures spread of middle 50 of the data

Example beginning pulse rates

Q3 = 78 Q1 = 63

IQR = 78 ndash 63 = 15

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 39: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

AdvantagesDisadvantages of Stem-and-Leaf Displays

Advantages

1) each measurement displayed

2) ascending order in each stem row

3) relatively simple (data set not too large) Disadvantages

display becomes unwieldy for large data sets

Population of 185 US cities with between 100000 and 500000

Multiply stems by 100000

Back-to-back stem-and-leaf displays TD passes by NFL teams 1999-2000 2012-13multiply stems by 10

1999-2000 2012-13

2 4 03

6 3 7

2 3 24

6655 2 6677789

43322221100 2 01222233444

9998887666 1 67889

421 1 134

0 8

Below is a stem-and-leaf display for the pulse rates of 24 women at a health clinic How many pulses are between 67 and 77

Stems are 10rsquos digits

1 4

2 6

3 8

4 10

5 12

Other Graphical Methods for Data Time plots

plot observations in time order time on horizontal axis variable on vertical axis

Time series

measurements are taken at regular intervals (monthly unemployment quarterly GDP weather records electricity demand etc)

Heat maps word walls

Unemployment Rate by Educational Attainment

Water Use During Super Bowl XLV(Packers 31 Steelers 25)

Heat Maps

Word Wall (customer feedback)

Section 32Describing the Center of Data

Mean

Median

2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability (next section)

measures how ldquospread outrdquo the data is

Notation for Data Valuesand Sample Mean

1 2

1 2

3

The sample size is denoted by

For a variable denoted by its observations are denoted by

A common measure of center is the sample mean

The sample mean is denoted by

Shorte

n

n

y y yy

n

y

y y y y

y

n

1 21

1

ned expression for using the symbol

(uppercase Greek letter sigma)n

n

i

i n

i

i

y

y y y

yy

n

y

Simple Example of Sample Mean

Weekly TV viewing time in hours of 7 randomly selected 4th graders

19 40 16 12 10 6 and 97

1

7

1

19 40 16 12 10 6 9 112

11216

7 7

ii

ii

y

yy

Population Mean

1

population

population mea

Denoted by the Greek letter

is the size (for example =34000 for NCSU)

the value of is typically not known

we often use the sample mean

to estimat

n

e the unknown

N

ii

y

N N

y

N

value of

Connection Between Mean and Histogram

A histogram balances when supported at the mean Mean x = 1406

Histogram

0

10

20

30

40

50

60

70

118

5

125

5

132

5

139

5

146

5

153

5

16

05

Mo

re

Absences f rom Work

Fre

qu

en

cy

Frequency

The median anothermeasure of center

Given a set of n data values arranged in order of magnitude

Median= middle value n odd

mean of 2 middle values n even

Ex 2 4 6 8 10 n=5 median=6 Ex 2 4 6 8 n=4 median=(4+6)2=5

Student Pulse Rates (n=62)

38 59 60 60 62 62 63 63 64 64 65 67 68 70 70 70 70 70 70 70 71 71 72 72 73 74 74 75 75 75 75 76 77 77 77 77 78 78 79 79 80 80 80 84 84 85 85 87 90 90 91 92 93 94 94 95 96 96 96 98 98 103

Median = (75+76)2 = 755

The median splits the histogram into 2 halves of equal area

Mean balance pointMedian 50 area each half

mean 5526 years median 577years

Medians are used often

Year 2011 baseball salaries

Median $1450000 (max=$32000000 Alex Rodriguez min=$414000)

Median fan age MLB 45 NFL 43 NBA 41 NHL 39

Median existing home sales price May 2011 $166500 May 2010 $174600

Median household income (2008 dollars) 2009 $50221 2008 $52029

Examples Example n = 7

175 28 32 139 141 253 458 Example n = 7 (ordered) 28 32 139 141 175 253 458 Example n = 8

175 28 32 139 141 253 357 458

Example n =8 (ordered)

28 32 139 141 175 253 357 458

m = 141

m = (141+175)2 = 158

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496049604971524555467586

1 5245

2 49655

3 4960

4 4971

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496052455546497155877586

1 5245

2 49655

3 5546

4 4971

Properties of Mean Median1The mean and median are unique that is a

data set has only 1 mean and 1 median (the mean and median are not necessarily equal)

2The mean uses the value of every number in the data set the median does not

14

20 4 6Ex 2 4 6 8 5 5

4 2

21 4 6Ex 2 4 6 9 5 5

4 2

x m

x m

Example class pulse rates

53 64 67 67 70 76 77 77 78 83 84 85 85 89 90 90 90 90 91 96 98 103 140

23

1

23

844823

location 12th obs 85

ii

n

xx

m m

2010 2014 baseball salaries

2010

n = 845

mean = $3297828

median = $1330000

max = $33000000

2014

n = 848

mean = $3932912

median = $1456250

max = $28000000

>

Disadvantage of the mean

Can be greatly influenced by just a few observations that are much greater or much smaller than the rest of the data

Mean Median Maximum Baseball Salaries 1985 - 201419

85

1987

1989

1991

1993

1995

1997

1999

2001

2003

2005

2007

2009

2011

2013

200000

700000

1200000

1700000

2200000

2700000

3200000

3700000

0

5000000

10000000

15000000

20000000

25000000

30000000

35000000

Baseball Salaries Mean Median and Maximum 1985-2014

Mean Median Maximum

Year

Mea

n M

edia

n S

alar

y

Max

imu

m S

alar

y

Skewness comparing the mean and median

Skewed to the right (positively skewed) meangtmedian

53

490

102 7235 21 26 17 8 10 2 3 1 0 0 1

0

100

200

300

400

500

600

Freq

uenc

y

Salary ($1000s)

2011 Baseball Salaries

Skewed to the left negatively skewed

Mean lt median mean=78 median=87

Histogram of Exam Scores

0

10

20

30

20 30 40 50 60 70 80 90 100Exam Scores

Fre

qu

en

cy

Symmetric data

mean median approx equal

Bank Customers 1000-1100 am

0

5

10

15

20

Number of Customers

Fre

qu

en

cy

Section 33Describing Variability of Data

Standard Deviation

Using the Mean and Standard Deviation Together 68-95-997

Rule (Empirical Rule)

Recall 2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability

measures how ldquospread outrdquo the data is

Ways to measure variability

1 range=largest-smallest

ok sometimes in general too crude sensitive to one large or small obs

1

2 where

the middle is the mean

deviation of from the mean

( ) sum the deviations of all the s from

measure spread from the middle

i i

n

i ii

y

y y y

y y y y

1

( ) 0 always tells us nothingn

ii

y y

Example

1 2

1 2

1 2

1 2

sum of deviations from mean

49 51 50

( ) ( ) (49 50) (51 50) 1 1 0

0 100

Data set 1

Data set 2 50

( ) ( ) (0 50) (100 50) 50 50 0

x x x

x x x x

y y y

y y y y

The Sample Standard Deviation a measure of spread around the mean Square the deviation of each

observation from the mean find the square root of the ldquoaveragerdquo of these squared deviations

2

1

2

2 1

( )sample standard deviation

1

( )is called the sample variance

1

n

ii

n

ii

y ys

n

y ys

n

Calculations hellip

Mean = 634

Sum of squared deviations from mean = 852

(n minus 1) = 13 (n minus 1) is called degrees freedom (df)

s2 = variance = 85213 = 655 square inches

s = standard deviation = radic655 = 256 inches

Women height (inches)i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

2

1

2 )(1

1xx

ns

n

i

1 First calculate the variance s22 Then take the square root to get the

standard deviation s

2

1

)(1

1xx

ns

n

i

Meanplusmn 1 sd

Wersquoll never calculate these by hand so make sure to know how to get the standard deviation using your calculator Excel or other software

Population Standard Deviation

2

1

Denoted by the lower case Greek letter

is the size (for example =34000 for NCSU)

is the mean

( )population standard deviation

va

po

lue of typically not known

us

pulation

populatio

e

n

N

ii

N N

y

N

s

to estimate value of

Remarks

1 The standard deviation of a set of measurements is an estimate of the likely size of the chance error in a single measurement

Remarks (cont)

2 Note that s and s are always greater than or equal to zero

3 The larger the value of s (or s ) the greater the spread of the data

When does s=0 When does s =0

When all data values are the same

Remarks (cont)4 The standard deviation is the most

commonly used measure of risk in finance and businessndash Stocks Mutual Funds etc

5 Variance s2 sample variance 2 population variance Units are squared units of the original data square $ square gallons

Review Properties of s and s s and s are always greater than or

equal to 0

when does s = 0 s = 0 The larger the value of s (or s) the

greater the spread of the data the standard deviation of a set of

measurements is an estimate of the likely size of the chance error in a single measurement

Summary of Notation

2

SAMPLE

sample mean

sample median

sample variance

sample stand dev

y

m

s

s

2

POPULATION

population mean

population median

population variance

population stand dev

m

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

68-95-997 rule

Mean andStandard Deviation

(numerical)

Histogram(graphical)

68-95-997 rule

The 68-95-997 ruleIf the histogram of the data is

approximately bell-shaped then1) approximately of the measurements

are of the mean

that is in ( )

2) approximately of the measurement

68

within 1 standard deviation

95

within 2 standard deviation

s

are of the meas n

that is

y s y s

almost all

within 3 standard deviation

in ( 2 2 )

3) the measurements

are of the mean

that is in ( 3 3 )

s

y s y s

y s y s

68-95-997 rule 68 within 1 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

68

3434

y-s y y+s

68-95-997 rule 95 within 2 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

95

475 475

y-2s y y+2s

Example textbook costs

37548

4272

50

y

s

n

286 291 307 308 315 316 327328 340 342 346 347 348 348 349354 355 355 360 361 364 367 369371 373 377 380 381 382 385 385387 390 390 397 398 409 409 410418 422 424 425 426 428 433 434437 440 480

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( ) (33276 41820)

32percentage of data values in this interval 64

5068-95-997 rule 68

y s

y s y s

1 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 2 2 ) (29004 46092)

48percentage of data values in this interval 96

5068-95-997 rule 95

y s

y s y s

2 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 3 3 ) (24732 50364)

50percentage of data values in this interval 100

5068-95-997 rule 997

y s

y s y s

3 standard deviation interval about the mean

The best estimate of the standard deviation of the menrsquos weights

displayed in this dotplot is

1 10

2 15

3 20

4 40

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

Preceding slides Next

Z-scores Standardized Data Values

Measures the distance of a number from the mean in units of

the standard deviation

z-score corresponding to y

where

original data value

the sample mean

s the sample standard deviation

the z-score corresponding to

y yz

s

y

y

z y

Exam 1 y1 = 88 s1 = 6 your exam 1 score 91

Exam 2 y2 = 88 s2 = 10 your exam 2 score 92

Which score is better

1

2

91 88 3z 5

6 692 88 4

z 410 10

91 on exam 1 is better than 92 on exam 2

If data has mean and standard deviation

then standardizing a particular value of

indicates how many standard deviations

is above or below the mean

y s

y

y

y

Comparing SAT and ACT Scores

SAT Math Eleanorrsquos score 680

SAT mean =500 sd=100 ACT Math Geraldrsquos score 27

ACT mean=18 sd=6 Eleanorrsquos z-score z=(680-500)100=18 Geraldrsquos z-score z=(27-18)6=15 Eleanorrsquos score is better

Z-scores add to zeroStudentInstitutional Support to Athletic Depts For the 9 Public ACC

Schools 2013 ($ millions)

School Support y - ybar Z-score

Maryland 155 64 179

UVA 131 40 112

Louisville 109 18 050

UNC 92 01 003

VaTech 79 -12 -034

FSU 79 -12 -034

GaTech 71 -20 -056

NCSU 65 -26 -073

Clemson 38 -53 -147

Mean=91000 s=35697

Sum = 0 Sum = 0

Recently the mean tuition at 4-yr public collegesuniversities in the US was $6185 with a standard deviation of $1804 In NC the mean tuition was $4320 What is NCrsquos z-score

1 103

2 -103

3 239

4 1865

5 -1865

Section 34Measures of Position (also called Measures of Relative Standing)

Quartiles

5-Number Summary

Interquartile Range Another Measure of Spread

Boxplots

m = median = 34

Q1= first quartile = 23

Q3= third quartile = 42

1 1 062 2 123 3 164 4 195 5 156 6 217 7 238 6 239 5 2510 4 2811 3 2912 2 3313 1 3414 2 3615 3 3716 4 3817 5 3918 6 4119 7 4220 6 4521 5 4722 4 4923 3 5324 2 5625 1 61

Quartiles Measuring spread by examining the middleThe first quartile Q1 is the value in the

sample that has 25 of the data at or

below it (Q1 is the median of the lower

half of the sorted data)

The third quartile Q3 is the value in the

sample that has 75 of the data at or

below it (Q3 is the median of the upper

half of the sorted data)

Quartiles and median divide data into 4 pieces

Q1 M Q3

14 14 14 14

Quartiles are common measures of spread

httpoirpncsueduiradmit

httpoirpncsueduunivpeer

University of Southern California

Economic Value of College Majors

Rules for Calculating QuartilesStep 1 find the median of all the data (the median divides the data in half)

Step 2a find the median of the lower half this median is Q1Step 2b find the median of the upper half this median is Q3

Importantwhen n is odd include the overall median in both halveswhen n is even do not include the overall median in either half

Example 2 4 6 8 10 12 14 16 18 20 n = 10

Median m = (10+12)2 = 222 = 11

Q1 median of lower half 2 4 6 8 10

Q1 = 6

Q3 median of upper half 12 14 16 18 20

Q3 = 16

11

Pulse Rates n = 138

Stem Leaves4

3 4 5889 5 00123344410 5 555678889923 6 0001111112223333334444423 6 5555666666777778888888816 7 0000011222233444423 7 5555566666677788888899910 8 000011222410 8 55556677894 9 00122 9 584 10 0223

101 11 1

Median mean of pulses in locations 69 amp 70 median= (70+70)2=70

Q1 median of lower half (lower half = 69 smallest pulses) Q1 = pulse in ordered position 35Q1 = 63

Q3 median of upper half (upper half = 69 largest pulses) Q3= pulse in position 35 from the high end Q3=78

Below are the weights of 31 linemen on the NCSU football team What is the

value of the first quartile Q1

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 287

2 2575

3 2635

4 2625

Interquartile range another measure of spread

lower quartile Q1

middle quartile median upper quartile Q3

interquartile range (IQR)

IQR = Q3 ndash Q1

measures spread of middle 50 of the data

Example beginning pulse rates

Q3 = 78 Q1 = 63

IQR = 78 ndash 63 = 15

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 40: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Population of 185 US cities with between 100000 and 500000

Multiply stems by 100000

Back-to-back stem-and-leaf displays TD passes by NFL teams 1999-2000 2012-13multiply stems by 10

1999-2000 2012-13

2 4 03

6 3 7

2 3 24

6655 2 6677789

43322221100 2 01222233444

9998887666 1 67889

421 1 134

0 8

Below is a stem-and-leaf display for the pulse rates of 24 women at a health clinic How many pulses are between 67 and 77

Stems are 10rsquos digits

1 4

2 6

3 8

4 10

5 12

Other Graphical Methods for Data Time plots

plot observations in time order time on horizontal axis variable on vertical axis

Time series

measurements are taken at regular intervals (monthly unemployment quarterly GDP weather records electricity demand etc)

Heat maps word walls

Unemployment Rate by Educational Attainment

Water Use During Super Bowl XLV(Packers 31 Steelers 25)

Heat Maps

Word Wall (customer feedback)

Section 32Describing the Center of Data

Mean

Median

2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability (next section)

measures how ldquospread outrdquo the data is

Notation for Data Valuesand Sample Mean

1 2

1 2

3

The sample size is denoted by

For a variable denoted by its observations are denoted by

A common measure of center is the sample mean

The sample mean is denoted by

Shorte

n

n

y y yy

n

y

y y y y

y

n

1 21

1

ned expression for using the symbol

(uppercase Greek letter sigma)n

n

i

i n

i

i

y

y y y

yy

n

y

Simple Example of Sample Mean

Weekly TV viewing time in hours of 7 randomly selected 4th graders

19 40 16 12 10 6 and 97

1

7

1

19 40 16 12 10 6 9 112

11216

7 7

ii

ii

y

yy

Population Mean

1

population

population mea

Denoted by the Greek letter

is the size (for example =34000 for NCSU)

the value of is typically not known

we often use the sample mean

to estimat

n

e the unknown

N

ii

y

N N

y

N

value of

Connection Between Mean and Histogram

A histogram balances when supported at the mean Mean x = 1406

Histogram

0

10

20

30

40

50

60

70

118

5

125

5

132

5

139

5

146

5

153

5

16

05

Mo

re

Absences f rom Work

Fre

qu

en

cy

Frequency

The median anothermeasure of center

Given a set of n data values arranged in order of magnitude

Median= middle value n odd

mean of 2 middle values n even

Ex 2 4 6 8 10 n=5 median=6 Ex 2 4 6 8 n=4 median=(4+6)2=5

Student Pulse Rates (n=62)

38 59 60 60 62 62 63 63 64 64 65 67 68 70 70 70 70 70 70 70 71 71 72 72 73 74 74 75 75 75 75 76 77 77 77 77 78 78 79 79 80 80 80 84 84 85 85 87 90 90 91 92 93 94 94 95 96 96 96 98 98 103

Median = (75+76)2 = 755

The median splits the histogram into 2 halves of equal area

Mean balance pointMedian 50 area each half

mean 5526 years median 577years

Medians are used often

Year 2011 baseball salaries

Median $1450000 (max=$32000000 Alex Rodriguez min=$414000)

Median fan age MLB 45 NFL 43 NBA 41 NHL 39

Median existing home sales price May 2011 $166500 May 2010 $174600

Median household income (2008 dollars) 2009 $50221 2008 $52029

Examples Example n = 7

175 28 32 139 141 253 458 Example n = 7 (ordered) 28 32 139 141 175 253 458 Example n = 8

175 28 32 139 141 253 357 458

Example n =8 (ordered)

28 32 139 141 175 253 357 458

m = 141

m = (141+175)2 = 158

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496049604971524555467586

1 5245

2 49655

3 4960

4 4971

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496052455546497155877586

1 5245

2 49655

3 5546

4 4971

Properties of Mean Median1The mean and median are unique that is a

data set has only 1 mean and 1 median (the mean and median are not necessarily equal)

2The mean uses the value of every number in the data set the median does not

14

20 4 6Ex 2 4 6 8 5 5

4 2

21 4 6Ex 2 4 6 9 5 5

4 2

x m

x m

Example class pulse rates

53 64 67 67 70 76 77 77 78 83 84 85 85 89 90 90 90 90 91 96 98 103 140

23

1

23

844823

location 12th obs 85

ii

n

xx

m m

2010 2014 baseball salaries

2010

n = 845

mean = $3297828

median = $1330000

max = $33000000

2014

n = 848

mean = $3932912

median = $1456250

max = $28000000

>

Disadvantage of the mean

Can be greatly influenced by just a few observations that are much greater or much smaller than the rest of the data

Mean Median Maximum Baseball Salaries 1985 - 201419

85

1987

1989

1991

1993

1995

1997

1999

2001

2003

2005

2007

2009

2011

2013

200000

700000

1200000

1700000

2200000

2700000

3200000

3700000

0

5000000

10000000

15000000

20000000

25000000

30000000

35000000

Baseball Salaries Mean Median and Maximum 1985-2014

Mean Median Maximum

Year

Mea

n M

edia

n S

alar

y

Max

imu

m S

alar

y

Skewness comparing the mean and median

Skewed to the right (positively skewed) meangtmedian

53

490

102 7235 21 26 17 8 10 2 3 1 0 0 1

0

100

200

300

400

500

600

Freq

uenc

y

Salary ($1000s)

2011 Baseball Salaries

Skewed to the left negatively skewed

Mean lt median mean=78 median=87

Histogram of Exam Scores

0

10

20

30

20 30 40 50 60 70 80 90 100Exam Scores

Fre

qu

en

cy

Symmetric data

mean median approx equal

Bank Customers 1000-1100 am

0

5

10

15

20

Number of Customers

Fre

qu

en

cy

Section 33Describing Variability of Data

Standard Deviation

Using the Mean and Standard Deviation Together 68-95-997

Rule (Empirical Rule)

Recall 2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability

measures how ldquospread outrdquo the data is

Ways to measure variability

1 range=largest-smallest

ok sometimes in general too crude sensitive to one large or small obs

1

2 where

the middle is the mean

deviation of from the mean

( ) sum the deviations of all the s from

measure spread from the middle

i i

n

i ii

y

y y y

y y y y

1

( ) 0 always tells us nothingn

ii

y y

Example

1 2

1 2

1 2

1 2

sum of deviations from mean

49 51 50

( ) ( ) (49 50) (51 50) 1 1 0

0 100

Data set 1

Data set 2 50

( ) ( ) (0 50) (100 50) 50 50 0

x x x

x x x x

y y y

y y y y

The Sample Standard Deviation a measure of spread around the mean Square the deviation of each

observation from the mean find the square root of the ldquoaveragerdquo of these squared deviations

2

1

2

2 1

( )sample standard deviation

1

( )is called the sample variance

1

n

ii

n

ii

y ys

n

y ys

n

Calculations hellip

Mean = 634

Sum of squared deviations from mean = 852

(n minus 1) = 13 (n minus 1) is called degrees freedom (df)

s2 = variance = 85213 = 655 square inches

s = standard deviation = radic655 = 256 inches

Women height (inches)i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

2

1

2 )(1

1xx

ns

n

i

1 First calculate the variance s22 Then take the square root to get the

standard deviation s

2

1

)(1

1xx

ns

n

i

Meanplusmn 1 sd

Wersquoll never calculate these by hand so make sure to know how to get the standard deviation using your calculator Excel or other software

Population Standard Deviation

2

1

Denoted by the lower case Greek letter

is the size (for example =34000 for NCSU)

is the mean

( )population standard deviation

va

po

lue of typically not known

us

pulation

populatio

e

n

N

ii

N N

y

N

s

to estimate value of

Remarks

1 The standard deviation of a set of measurements is an estimate of the likely size of the chance error in a single measurement

Remarks (cont)

2 Note that s and s are always greater than or equal to zero

3 The larger the value of s (or s ) the greater the spread of the data

When does s=0 When does s =0

When all data values are the same

Remarks (cont)4 The standard deviation is the most

commonly used measure of risk in finance and businessndash Stocks Mutual Funds etc

5 Variance s2 sample variance 2 population variance Units are squared units of the original data square $ square gallons

Review Properties of s and s s and s are always greater than or

equal to 0

when does s = 0 s = 0 The larger the value of s (or s) the

greater the spread of the data the standard deviation of a set of

measurements is an estimate of the likely size of the chance error in a single measurement

Summary of Notation

2

SAMPLE

sample mean

sample median

sample variance

sample stand dev

y

m

s

s

2

POPULATION

population mean

population median

population variance

population stand dev

m

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

68-95-997 rule

Mean andStandard Deviation

(numerical)

Histogram(graphical)

68-95-997 rule

The 68-95-997 ruleIf the histogram of the data is

approximately bell-shaped then1) approximately of the measurements

are of the mean

that is in ( )

2) approximately of the measurement

68

within 1 standard deviation

95

within 2 standard deviation

s

are of the meas n

that is

y s y s

almost all

within 3 standard deviation

in ( 2 2 )

3) the measurements

are of the mean

that is in ( 3 3 )

s

y s y s

y s y s

68-95-997 rule 68 within 1 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

68

3434

y-s y y+s

68-95-997 rule 95 within 2 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

95

475 475

y-2s y y+2s

Example textbook costs

37548

4272

50

y

s

n

286 291 307 308 315 316 327328 340 342 346 347 348 348 349354 355 355 360 361 364 367 369371 373 377 380 381 382 385 385387 390 390 397 398 409 409 410418 422 424 425 426 428 433 434437 440 480

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( ) (33276 41820)

32percentage of data values in this interval 64

5068-95-997 rule 68

y s

y s y s

1 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 2 2 ) (29004 46092)

48percentage of data values in this interval 96

5068-95-997 rule 95

y s

y s y s

2 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 3 3 ) (24732 50364)

50percentage of data values in this interval 100

5068-95-997 rule 997

y s

y s y s

3 standard deviation interval about the mean

The best estimate of the standard deviation of the menrsquos weights

displayed in this dotplot is

1 10

2 15

3 20

4 40

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

Preceding slides Next

Z-scores Standardized Data Values

Measures the distance of a number from the mean in units of

the standard deviation

z-score corresponding to y

where

original data value

the sample mean

s the sample standard deviation

the z-score corresponding to

y yz

s

y

y

z y

Exam 1 y1 = 88 s1 = 6 your exam 1 score 91

Exam 2 y2 = 88 s2 = 10 your exam 2 score 92

Which score is better

1

2

91 88 3z 5

6 692 88 4

z 410 10

91 on exam 1 is better than 92 on exam 2

If data has mean and standard deviation

then standardizing a particular value of

indicates how many standard deviations

is above or below the mean

y s

y

y

y

Comparing SAT and ACT Scores

SAT Math Eleanorrsquos score 680

SAT mean =500 sd=100 ACT Math Geraldrsquos score 27

ACT mean=18 sd=6 Eleanorrsquos z-score z=(680-500)100=18 Geraldrsquos z-score z=(27-18)6=15 Eleanorrsquos score is better

Z-scores add to zeroStudentInstitutional Support to Athletic Depts For the 9 Public ACC

Schools 2013 ($ millions)

School Support y - ybar Z-score

Maryland 155 64 179

UVA 131 40 112

Louisville 109 18 050

UNC 92 01 003

VaTech 79 -12 -034

FSU 79 -12 -034

GaTech 71 -20 -056

NCSU 65 -26 -073

Clemson 38 -53 -147

Mean=91000 s=35697

Sum = 0 Sum = 0

Recently the mean tuition at 4-yr public collegesuniversities in the US was $6185 with a standard deviation of $1804 In NC the mean tuition was $4320 What is NCrsquos z-score

1 103

2 -103

3 239

4 1865

5 -1865

Section 34Measures of Position (also called Measures of Relative Standing)

Quartiles

5-Number Summary

Interquartile Range Another Measure of Spread

Boxplots

m = median = 34

Q1= first quartile = 23

Q3= third quartile = 42

1 1 062 2 123 3 164 4 195 5 156 6 217 7 238 6 239 5 2510 4 2811 3 2912 2 3313 1 3414 2 3615 3 3716 4 3817 5 3918 6 4119 7 4220 6 4521 5 4722 4 4923 3 5324 2 5625 1 61

Quartiles Measuring spread by examining the middleThe first quartile Q1 is the value in the

sample that has 25 of the data at or

below it (Q1 is the median of the lower

half of the sorted data)

The third quartile Q3 is the value in the

sample that has 75 of the data at or

below it (Q3 is the median of the upper

half of the sorted data)

Quartiles and median divide data into 4 pieces

Q1 M Q3

14 14 14 14

Quartiles are common measures of spread

httpoirpncsueduiradmit

httpoirpncsueduunivpeer

University of Southern California

Economic Value of College Majors

Rules for Calculating QuartilesStep 1 find the median of all the data (the median divides the data in half)

Step 2a find the median of the lower half this median is Q1Step 2b find the median of the upper half this median is Q3

Importantwhen n is odd include the overall median in both halveswhen n is even do not include the overall median in either half

Example 2 4 6 8 10 12 14 16 18 20 n = 10

Median m = (10+12)2 = 222 = 11

Q1 median of lower half 2 4 6 8 10

Q1 = 6

Q3 median of upper half 12 14 16 18 20

Q3 = 16

11

Pulse Rates n = 138

Stem Leaves4

3 4 5889 5 00123344410 5 555678889923 6 0001111112223333334444423 6 5555666666777778888888816 7 0000011222233444423 7 5555566666677788888899910 8 000011222410 8 55556677894 9 00122 9 584 10 0223

101 11 1

Median mean of pulses in locations 69 amp 70 median= (70+70)2=70

Q1 median of lower half (lower half = 69 smallest pulses) Q1 = pulse in ordered position 35Q1 = 63

Q3 median of upper half (upper half = 69 largest pulses) Q3= pulse in position 35 from the high end Q3=78

Below are the weights of 31 linemen on the NCSU football team What is the

value of the first quartile Q1

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 287

2 2575

3 2635

4 2625

Interquartile range another measure of spread

lower quartile Q1

middle quartile median upper quartile Q3

interquartile range (IQR)

IQR = Q3 ndash Q1

measures spread of middle 50 of the data

Example beginning pulse rates

Q3 = 78 Q1 = 63

IQR = 78 ndash 63 = 15

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 41: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Back-to-back stem-and-leaf displays TD passes by NFL teams 1999-2000 2012-13multiply stems by 10

1999-2000 2012-13

2 4 03

6 3 7

2 3 24

6655 2 6677789

43322221100 2 01222233444

9998887666 1 67889

421 1 134

0 8

Below is a stem-and-leaf display for the pulse rates of 24 women at a health clinic How many pulses are between 67 and 77

Stems are 10rsquos digits

1 4

2 6

3 8

4 10

5 12

Other Graphical Methods for Data Time plots

plot observations in time order time on horizontal axis variable on vertical axis

Time series

measurements are taken at regular intervals (monthly unemployment quarterly GDP weather records electricity demand etc)

Heat maps word walls

Unemployment Rate by Educational Attainment

Water Use During Super Bowl XLV(Packers 31 Steelers 25)

Heat Maps

Word Wall (customer feedback)

Section 32Describing the Center of Data

Mean

Median

2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability (next section)

measures how ldquospread outrdquo the data is

Notation for Data Valuesand Sample Mean

1 2

1 2

3

The sample size is denoted by

For a variable denoted by its observations are denoted by

A common measure of center is the sample mean

The sample mean is denoted by

Shorte

n

n

y y yy

n

y

y y y y

y

n

1 21

1

ned expression for using the symbol

(uppercase Greek letter sigma)n

n

i

i n

i

i

y

y y y

yy

n

y

Simple Example of Sample Mean

Weekly TV viewing time in hours of 7 randomly selected 4th graders

19 40 16 12 10 6 and 97

1

7

1

19 40 16 12 10 6 9 112

11216

7 7

ii

ii

y

yy

Population Mean

1

population

population mea

Denoted by the Greek letter

is the size (for example =34000 for NCSU)

the value of is typically not known

we often use the sample mean

to estimat

n

e the unknown

N

ii

y

N N

y

N

value of

Connection Between Mean and Histogram

A histogram balances when supported at the mean Mean x = 1406

Histogram

0

10

20

30

40

50

60

70

118

5

125

5

132

5

139

5

146

5

153

5

16

05

Mo

re

Absences f rom Work

Fre

qu

en

cy

Frequency

The median anothermeasure of center

Given a set of n data values arranged in order of magnitude

Median= middle value n odd

mean of 2 middle values n even

Ex 2 4 6 8 10 n=5 median=6 Ex 2 4 6 8 n=4 median=(4+6)2=5

Student Pulse Rates (n=62)

38 59 60 60 62 62 63 63 64 64 65 67 68 70 70 70 70 70 70 70 71 71 72 72 73 74 74 75 75 75 75 76 77 77 77 77 78 78 79 79 80 80 80 84 84 85 85 87 90 90 91 92 93 94 94 95 96 96 96 98 98 103

Median = (75+76)2 = 755

The median splits the histogram into 2 halves of equal area

Mean balance pointMedian 50 area each half

mean 5526 years median 577years

Medians are used often

Year 2011 baseball salaries

Median $1450000 (max=$32000000 Alex Rodriguez min=$414000)

Median fan age MLB 45 NFL 43 NBA 41 NHL 39

Median existing home sales price May 2011 $166500 May 2010 $174600

Median household income (2008 dollars) 2009 $50221 2008 $52029

Examples Example n = 7

175 28 32 139 141 253 458 Example n = 7 (ordered) 28 32 139 141 175 253 458 Example n = 8

175 28 32 139 141 253 357 458

Example n =8 (ordered)

28 32 139 141 175 253 357 458

m = 141

m = (141+175)2 = 158

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496049604971524555467586

1 5245

2 49655

3 4960

4 4971

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496052455546497155877586

1 5245

2 49655

3 5546

4 4971

Properties of Mean Median1The mean and median are unique that is a

data set has only 1 mean and 1 median (the mean and median are not necessarily equal)

2The mean uses the value of every number in the data set the median does not

14

20 4 6Ex 2 4 6 8 5 5

4 2

21 4 6Ex 2 4 6 9 5 5

4 2

x m

x m

Example class pulse rates

53 64 67 67 70 76 77 77 78 83 84 85 85 89 90 90 90 90 91 96 98 103 140

23

1

23

844823

location 12th obs 85

ii

n

xx

m m

2010 2014 baseball salaries

2010

n = 845

mean = $3297828

median = $1330000

max = $33000000

2014

n = 848

mean = $3932912

median = $1456250

max = $28000000

>

Disadvantage of the mean

Can be greatly influenced by just a few observations that are much greater or much smaller than the rest of the data

Mean Median Maximum Baseball Salaries 1985 - 201419

85

1987

1989

1991

1993

1995

1997

1999

2001

2003

2005

2007

2009

2011

2013

200000

700000

1200000

1700000

2200000

2700000

3200000

3700000

0

5000000

10000000

15000000

20000000

25000000

30000000

35000000

Baseball Salaries Mean Median and Maximum 1985-2014

Mean Median Maximum

Year

Mea

n M

edia

n S

alar

y

Max

imu

m S

alar

y

Skewness comparing the mean and median

Skewed to the right (positively skewed) meangtmedian

53

490

102 7235 21 26 17 8 10 2 3 1 0 0 1

0

100

200

300

400

500

600

Freq

uenc

y

Salary ($1000s)

2011 Baseball Salaries

Skewed to the left negatively skewed

Mean lt median mean=78 median=87

Histogram of Exam Scores

0

10

20

30

20 30 40 50 60 70 80 90 100Exam Scores

Fre

qu

en

cy

Symmetric data

mean median approx equal

Bank Customers 1000-1100 am

0

5

10

15

20

Number of Customers

Fre

qu

en

cy

Section 33Describing Variability of Data

Standard Deviation

Using the Mean and Standard Deviation Together 68-95-997

Rule (Empirical Rule)

Recall 2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability

measures how ldquospread outrdquo the data is

Ways to measure variability

1 range=largest-smallest

ok sometimes in general too crude sensitive to one large or small obs

1

2 where

the middle is the mean

deviation of from the mean

( ) sum the deviations of all the s from

measure spread from the middle

i i

n

i ii

y

y y y

y y y y

1

( ) 0 always tells us nothingn

ii

y y

Example

1 2

1 2

1 2

1 2

sum of deviations from mean

49 51 50

( ) ( ) (49 50) (51 50) 1 1 0

0 100

Data set 1

Data set 2 50

( ) ( ) (0 50) (100 50) 50 50 0

x x x

x x x x

y y y

y y y y

The Sample Standard Deviation a measure of spread around the mean Square the deviation of each

observation from the mean find the square root of the ldquoaveragerdquo of these squared deviations

2

1

2

2 1

( )sample standard deviation

1

( )is called the sample variance

1

n

ii

n

ii

y ys

n

y ys

n

Calculations hellip

Mean = 634

Sum of squared deviations from mean = 852

(n minus 1) = 13 (n minus 1) is called degrees freedom (df)

s2 = variance = 85213 = 655 square inches

s = standard deviation = radic655 = 256 inches

Women height (inches)i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

2

1

2 )(1

1xx

ns

n

i

1 First calculate the variance s22 Then take the square root to get the

standard deviation s

2

1

)(1

1xx

ns

n

i

Meanplusmn 1 sd

Wersquoll never calculate these by hand so make sure to know how to get the standard deviation using your calculator Excel or other software

Population Standard Deviation

2

1

Denoted by the lower case Greek letter

is the size (for example =34000 for NCSU)

is the mean

( )population standard deviation

va

po

lue of typically not known

us

pulation

populatio

e

n

N

ii

N N

y

N

s

to estimate value of

Remarks

1 The standard deviation of a set of measurements is an estimate of the likely size of the chance error in a single measurement

Remarks (cont)

2 Note that s and s are always greater than or equal to zero

3 The larger the value of s (or s ) the greater the spread of the data

When does s=0 When does s =0

When all data values are the same

Remarks (cont)4 The standard deviation is the most

commonly used measure of risk in finance and businessndash Stocks Mutual Funds etc

5 Variance s2 sample variance 2 population variance Units are squared units of the original data square $ square gallons

Review Properties of s and s s and s are always greater than or

equal to 0

when does s = 0 s = 0 The larger the value of s (or s) the

greater the spread of the data the standard deviation of a set of

measurements is an estimate of the likely size of the chance error in a single measurement

Summary of Notation

2

SAMPLE

sample mean

sample median

sample variance

sample stand dev

y

m

s

s

2

POPULATION

population mean

population median

population variance

population stand dev

m

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

68-95-997 rule

Mean andStandard Deviation

(numerical)

Histogram(graphical)

68-95-997 rule

The 68-95-997 ruleIf the histogram of the data is

approximately bell-shaped then1) approximately of the measurements

are of the mean

that is in ( )

2) approximately of the measurement

68

within 1 standard deviation

95

within 2 standard deviation

s

are of the meas n

that is

y s y s

almost all

within 3 standard deviation

in ( 2 2 )

3) the measurements

are of the mean

that is in ( 3 3 )

s

y s y s

y s y s

68-95-997 rule 68 within 1 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

68

3434

y-s y y+s

68-95-997 rule 95 within 2 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

95

475 475

y-2s y y+2s

Example textbook costs

37548

4272

50

y

s

n

286 291 307 308 315 316 327328 340 342 346 347 348 348 349354 355 355 360 361 364 367 369371 373 377 380 381 382 385 385387 390 390 397 398 409 409 410418 422 424 425 426 428 433 434437 440 480

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( ) (33276 41820)

32percentage of data values in this interval 64

5068-95-997 rule 68

y s

y s y s

1 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 2 2 ) (29004 46092)

48percentage of data values in this interval 96

5068-95-997 rule 95

y s

y s y s

2 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 3 3 ) (24732 50364)

50percentage of data values in this interval 100

5068-95-997 rule 997

y s

y s y s

3 standard deviation interval about the mean

The best estimate of the standard deviation of the menrsquos weights

displayed in this dotplot is

1 10

2 15

3 20

4 40

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

Preceding slides Next

Z-scores Standardized Data Values

Measures the distance of a number from the mean in units of

the standard deviation

z-score corresponding to y

where

original data value

the sample mean

s the sample standard deviation

the z-score corresponding to

y yz

s

y

y

z y

Exam 1 y1 = 88 s1 = 6 your exam 1 score 91

Exam 2 y2 = 88 s2 = 10 your exam 2 score 92

Which score is better

1

2

91 88 3z 5

6 692 88 4

z 410 10

91 on exam 1 is better than 92 on exam 2

If data has mean and standard deviation

then standardizing a particular value of

indicates how many standard deviations

is above or below the mean

y s

y

y

y

Comparing SAT and ACT Scores

SAT Math Eleanorrsquos score 680

SAT mean =500 sd=100 ACT Math Geraldrsquos score 27

ACT mean=18 sd=6 Eleanorrsquos z-score z=(680-500)100=18 Geraldrsquos z-score z=(27-18)6=15 Eleanorrsquos score is better

Z-scores add to zeroStudentInstitutional Support to Athletic Depts For the 9 Public ACC

Schools 2013 ($ millions)

School Support y - ybar Z-score

Maryland 155 64 179

UVA 131 40 112

Louisville 109 18 050

UNC 92 01 003

VaTech 79 -12 -034

FSU 79 -12 -034

GaTech 71 -20 -056

NCSU 65 -26 -073

Clemson 38 -53 -147

Mean=91000 s=35697

Sum = 0 Sum = 0

Recently the mean tuition at 4-yr public collegesuniversities in the US was $6185 with a standard deviation of $1804 In NC the mean tuition was $4320 What is NCrsquos z-score

1 103

2 -103

3 239

4 1865

5 -1865

Section 34Measures of Position (also called Measures of Relative Standing)

Quartiles

5-Number Summary

Interquartile Range Another Measure of Spread

Boxplots

m = median = 34

Q1= first quartile = 23

Q3= third quartile = 42

1 1 062 2 123 3 164 4 195 5 156 6 217 7 238 6 239 5 2510 4 2811 3 2912 2 3313 1 3414 2 3615 3 3716 4 3817 5 3918 6 4119 7 4220 6 4521 5 4722 4 4923 3 5324 2 5625 1 61

Quartiles Measuring spread by examining the middleThe first quartile Q1 is the value in the

sample that has 25 of the data at or

below it (Q1 is the median of the lower

half of the sorted data)

The third quartile Q3 is the value in the

sample that has 75 of the data at or

below it (Q3 is the median of the upper

half of the sorted data)

Quartiles and median divide data into 4 pieces

Q1 M Q3

14 14 14 14

Quartiles are common measures of spread

httpoirpncsueduiradmit

httpoirpncsueduunivpeer

University of Southern California

Economic Value of College Majors

Rules for Calculating QuartilesStep 1 find the median of all the data (the median divides the data in half)

Step 2a find the median of the lower half this median is Q1Step 2b find the median of the upper half this median is Q3

Importantwhen n is odd include the overall median in both halveswhen n is even do not include the overall median in either half

Example 2 4 6 8 10 12 14 16 18 20 n = 10

Median m = (10+12)2 = 222 = 11

Q1 median of lower half 2 4 6 8 10

Q1 = 6

Q3 median of upper half 12 14 16 18 20

Q3 = 16

11

Pulse Rates n = 138

Stem Leaves4

3 4 5889 5 00123344410 5 555678889923 6 0001111112223333334444423 6 5555666666777778888888816 7 0000011222233444423 7 5555566666677788888899910 8 000011222410 8 55556677894 9 00122 9 584 10 0223

101 11 1

Median mean of pulses in locations 69 amp 70 median= (70+70)2=70

Q1 median of lower half (lower half = 69 smallest pulses) Q1 = pulse in ordered position 35Q1 = 63

Q3 median of upper half (upper half = 69 largest pulses) Q3= pulse in position 35 from the high end Q3=78

Below are the weights of 31 linemen on the NCSU football team What is the

value of the first quartile Q1

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 287

2 2575

3 2635

4 2625

Interquartile range another measure of spread

lower quartile Q1

middle quartile median upper quartile Q3

interquartile range (IQR)

IQR = Q3 ndash Q1

measures spread of middle 50 of the data

Example beginning pulse rates

Q3 = 78 Q1 = 63

IQR = 78 ndash 63 = 15

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 42: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Below is a stem-and-leaf display for the pulse rates of 24 women at a health clinic How many pulses are between 67 and 77

Stems are 10rsquos digits

1 4

2 6

3 8

4 10

5 12

Other Graphical Methods for Data Time plots

plot observations in time order time on horizontal axis variable on vertical axis

Time series

measurements are taken at regular intervals (monthly unemployment quarterly GDP weather records electricity demand etc)

Heat maps word walls

Unemployment Rate by Educational Attainment

Water Use During Super Bowl XLV(Packers 31 Steelers 25)

Heat Maps

Word Wall (customer feedback)

Section 32Describing the Center of Data

Mean

Median

2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability (next section)

measures how ldquospread outrdquo the data is

Notation for Data Valuesand Sample Mean

1 2

1 2

3

The sample size is denoted by

For a variable denoted by its observations are denoted by

A common measure of center is the sample mean

The sample mean is denoted by

Shorte

n

n

y y yy

n

y

y y y y

y

n

1 21

1

ned expression for using the symbol

(uppercase Greek letter sigma)n

n

i

i n

i

i

y

y y y

yy

n

y

Simple Example of Sample Mean

Weekly TV viewing time in hours of 7 randomly selected 4th graders

19 40 16 12 10 6 and 97

1

7

1

19 40 16 12 10 6 9 112

11216

7 7

ii

ii

y

yy

Population Mean

1

population

population mea

Denoted by the Greek letter

is the size (for example =34000 for NCSU)

the value of is typically not known

we often use the sample mean

to estimat

n

e the unknown

N

ii

y

N N

y

N

value of

Connection Between Mean and Histogram

A histogram balances when supported at the mean Mean x = 1406

Histogram

0

10

20

30

40

50

60

70

118

5

125

5

132

5

139

5

146

5

153

5

16

05

Mo

re

Absences f rom Work

Fre

qu

en

cy

Frequency

The median anothermeasure of center

Given a set of n data values arranged in order of magnitude

Median= middle value n odd

mean of 2 middle values n even

Ex 2 4 6 8 10 n=5 median=6 Ex 2 4 6 8 n=4 median=(4+6)2=5

Student Pulse Rates (n=62)

38 59 60 60 62 62 63 63 64 64 65 67 68 70 70 70 70 70 70 70 71 71 72 72 73 74 74 75 75 75 75 76 77 77 77 77 78 78 79 79 80 80 80 84 84 85 85 87 90 90 91 92 93 94 94 95 96 96 96 98 98 103

Median = (75+76)2 = 755

The median splits the histogram into 2 halves of equal area

Mean balance pointMedian 50 area each half

mean 5526 years median 577years

Medians are used often

Year 2011 baseball salaries

Median $1450000 (max=$32000000 Alex Rodriguez min=$414000)

Median fan age MLB 45 NFL 43 NBA 41 NHL 39

Median existing home sales price May 2011 $166500 May 2010 $174600

Median household income (2008 dollars) 2009 $50221 2008 $52029

Examples Example n = 7

175 28 32 139 141 253 458 Example n = 7 (ordered) 28 32 139 141 175 253 458 Example n = 8

175 28 32 139 141 253 357 458

Example n =8 (ordered)

28 32 139 141 175 253 357 458

m = 141

m = (141+175)2 = 158

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496049604971524555467586

1 5245

2 49655

3 4960

4 4971

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496052455546497155877586

1 5245

2 49655

3 5546

4 4971

Properties of Mean Median1The mean and median are unique that is a

data set has only 1 mean and 1 median (the mean and median are not necessarily equal)

2The mean uses the value of every number in the data set the median does not

14

20 4 6Ex 2 4 6 8 5 5

4 2

21 4 6Ex 2 4 6 9 5 5

4 2

x m

x m

Example class pulse rates

53 64 67 67 70 76 77 77 78 83 84 85 85 89 90 90 90 90 91 96 98 103 140

23

1

23

844823

location 12th obs 85

ii

n

xx

m m

2010 2014 baseball salaries

2010

n = 845

mean = $3297828

median = $1330000

max = $33000000

2014

n = 848

mean = $3932912

median = $1456250

max = $28000000

>

Disadvantage of the mean

Can be greatly influenced by just a few observations that are much greater or much smaller than the rest of the data

Mean Median Maximum Baseball Salaries 1985 - 201419

85

1987

1989

1991

1993

1995

1997

1999

2001

2003

2005

2007

2009

2011

2013

200000

700000

1200000

1700000

2200000

2700000

3200000

3700000

0

5000000

10000000

15000000

20000000

25000000

30000000

35000000

Baseball Salaries Mean Median and Maximum 1985-2014

Mean Median Maximum

Year

Mea

n M

edia

n S

alar

y

Max

imu

m S

alar

y

Skewness comparing the mean and median

Skewed to the right (positively skewed) meangtmedian

53

490

102 7235 21 26 17 8 10 2 3 1 0 0 1

0

100

200

300

400

500

600

Freq

uenc

y

Salary ($1000s)

2011 Baseball Salaries

Skewed to the left negatively skewed

Mean lt median mean=78 median=87

Histogram of Exam Scores

0

10

20

30

20 30 40 50 60 70 80 90 100Exam Scores

Fre

qu

en

cy

Symmetric data

mean median approx equal

Bank Customers 1000-1100 am

0

5

10

15

20

Number of Customers

Fre

qu

en

cy

Section 33Describing Variability of Data

Standard Deviation

Using the Mean and Standard Deviation Together 68-95-997

Rule (Empirical Rule)

Recall 2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability

measures how ldquospread outrdquo the data is

Ways to measure variability

1 range=largest-smallest

ok sometimes in general too crude sensitive to one large or small obs

1

2 where

the middle is the mean

deviation of from the mean

( ) sum the deviations of all the s from

measure spread from the middle

i i

n

i ii

y

y y y

y y y y

1

( ) 0 always tells us nothingn

ii

y y

Example

1 2

1 2

1 2

1 2

sum of deviations from mean

49 51 50

( ) ( ) (49 50) (51 50) 1 1 0

0 100

Data set 1

Data set 2 50

( ) ( ) (0 50) (100 50) 50 50 0

x x x

x x x x

y y y

y y y y

The Sample Standard Deviation a measure of spread around the mean Square the deviation of each

observation from the mean find the square root of the ldquoaveragerdquo of these squared deviations

2

1

2

2 1

( )sample standard deviation

1

( )is called the sample variance

1

n

ii

n

ii

y ys

n

y ys

n

Calculations hellip

Mean = 634

Sum of squared deviations from mean = 852

(n minus 1) = 13 (n minus 1) is called degrees freedom (df)

s2 = variance = 85213 = 655 square inches

s = standard deviation = radic655 = 256 inches

Women height (inches)i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

2

1

2 )(1

1xx

ns

n

i

1 First calculate the variance s22 Then take the square root to get the

standard deviation s

2

1

)(1

1xx

ns

n

i

Meanplusmn 1 sd

Wersquoll never calculate these by hand so make sure to know how to get the standard deviation using your calculator Excel or other software

Population Standard Deviation

2

1

Denoted by the lower case Greek letter

is the size (for example =34000 for NCSU)

is the mean

( )population standard deviation

va

po

lue of typically not known

us

pulation

populatio

e

n

N

ii

N N

y

N

s

to estimate value of

Remarks

1 The standard deviation of a set of measurements is an estimate of the likely size of the chance error in a single measurement

Remarks (cont)

2 Note that s and s are always greater than or equal to zero

3 The larger the value of s (or s ) the greater the spread of the data

When does s=0 When does s =0

When all data values are the same

Remarks (cont)4 The standard deviation is the most

commonly used measure of risk in finance and businessndash Stocks Mutual Funds etc

5 Variance s2 sample variance 2 population variance Units are squared units of the original data square $ square gallons

Review Properties of s and s s and s are always greater than or

equal to 0

when does s = 0 s = 0 The larger the value of s (or s) the

greater the spread of the data the standard deviation of a set of

measurements is an estimate of the likely size of the chance error in a single measurement

Summary of Notation

2

SAMPLE

sample mean

sample median

sample variance

sample stand dev

y

m

s

s

2

POPULATION

population mean

population median

population variance

population stand dev

m

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

68-95-997 rule

Mean andStandard Deviation

(numerical)

Histogram(graphical)

68-95-997 rule

The 68-95-997 ruleIf the histogram of the data is

approximately bell-shaped then1) approximately of the measurements

are of the mean

that is in ( )

2) approximately of the measurement

68

within 1 standard deviation

95

within 2 standard deviation

s

are of the meas n

that is

y s y s

almost all

within 3 standard deviation

in ( 2 2 )

3) the measurements

are of the mean

that is in ( 3 3 )

s

y s y s

y s y s

68-95-997 rule 68 within 1 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

68

3434

y-s y y+s

68-95-997 rule 95 within 2 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

95

475 475

y-2s y y+2s

Example textbook costs

37548

4272

50

y

s

n

286 291 307 308 315 316 327328 340 342 346 347 348 348 349354 355 355 360 361 364 367 369371 373 377 380 381 382 385 385387 390 390 397 398 409 409 410418 422 424 425 426 428 433 434437 440 480

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( ) (33276 41820)

32percentage of data values in this interval 64

5068-95-997 rule 68

y s

y s y s

1 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 2 2 ) (29004 46092)

48percentage of data values in this interval 96

5068-95-997 rule 95

y s

y s y s

2 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 3 3 ) (24732 50364)

50percentage of data values in this interval 100

5068-95-997 rule 997

y s

y s y s

3 standard deviation interval about the mean

The best estimate of the standard deviation of the menrsquos weights

displayed in this dotplot is

1 10

2 15

3 20

4 40

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

Preceding slides Next

Z-scores Standardized Data Values

Measures the distance of a number from the mean in units of

the standard deviation

z-score corresponding to y

where

original data value

the sample mean

s the sample standard deviation

the z-score corresponding to

y yz

s

y

y

z y

Exam 1 y1 = 88 s1 = 6 your exam 1 score 91

Exam 2 y2 = 88 s2 = 10 your exam 2 score 92

Which score is better

1

2

91 88 3z 5

6 692 88 4

z 410 10

91 on exam 1 is better than 92 on exam 2

If data has mean and standard deviation

then standardizing a particular value of

indicates how many standard deviations

is above or below the mean

y s

y

y

y

Comparing SAT and ACT Scores

SAT Math Eleanorrsquos score 680

SAT mean =500 sd=100 ACT Math Geraldrsquos score 27

ACT mean=18 sd=6 Eleanorrsquos z-score z=(680-500)100=18 Geraldrsquos z-score z=(27-18)6=15 Eleanorrsquos score is better

Z-scores add to zeroStudentInstitutional Support to Athletic Depts For the 9 Public ACC

Schools 2013 ($ millions)

School Support y - ybar Z-score

Maryland 155 64 179

UVA 131 40 112

Louisville 109 18 050

UNC 92 01 003

VaTech 79 -12 -034

FSU 79 -12 -034

GaTech 71 -20 -056

NCSU 65 -26 -073

Clemson 38 -53 -147

Mean=91000 s=35697

Sum = 0 Sum = 0

Recently the mean tuition at 4-yr public collegesuniversities in the US was $6185 with a standard deviation of $1804 In NC the mean tuition was $4320 What is NCrsquos z-score

1 103

2 -103

3 239

4 1865

5 -1865

Section 34Measures of Position (also called Measures of Relative Standing)

Quartiles

5-Number Summary

Interquartile Range Another Measure of Spread

Boxplots

m = median = 34

Q1= first quartile = 23

Q3= third quartile = 42

1 1 062 2 123 3 164 4 195 5 156 6 217 7 238 6 239 5 2510 4 2811 3 2912 2 3313 1 3414 2 3615 3 3716 4 3817 5 3918 6 4119 7 4220 6 4521 5 4722 4 4923 3 5324 2 5625 1 61

Quartiles Measuring spread by examining the middleThe first quartile Q1 is the value in the

sample that has 25 of the data at or

below it (Q1 is the median of the lower

half of the sorted data)

The third quartile Q3 is the value in the

sample that has 75 of the data at or

below it (Q3 is the median of the upper

half of the sorted data)

Quartiles and median divide data into 4 pieces

Q1 M Q3

14 14 14 14

Quartiles are common measures of spread

httpoirpncsueduiradmit

httpoirpncsueduunivpeer

University of Southern California

Economic Value of College Majors

Rules for Calculating QuartilesStep 1 find the median of all the data (the median divides the data in half)

Step 2a find the median of the lower half this median is Q1Step 2b find the median of the upper half this median is Q3

Importantwhen n is odd include the overall median in both halveswhen n is even do not include the overall median in either half

Example 2 4 6 8 10 12 14 16 18 20 n = 10

Median m = (10+12)2 = 222 = 11

Q1 median of lower half 2 4 6 8 10

Q1 = 6

Q3 median of upper half 12 14 16 18 20

Q3 = 16

11

Pulse Rates n = 138

Stem Leaves4

3 4 5889 5 00123344410 5 555678889923 6 0001111112223333334444423 6 5555666666777778888888816 7 0000011222233444423 7 5555566666677788888899910 8 000011222410 8 55556677894 9 00122 9 584 10 0223

101 11 1

Median mean of pulses in locations 69 amp 70 median= (70+70)2=70

Q1 median of lower half (lower half = 69 smallest pulses) Q1 = pulse in ordered position 35Q1 = 63

Q3 median of upper half (upper half = 69 largest pulses) Q3= pulse in position 35 from the high end Q3=78

Below are the weights of 31 linemen on the NCSU football team What is the

value of the first quartile Q1

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 287

2 2575

3 2635

4 2625

Interquartile range another measure of spread

lower quartile Q1

middle quartile median upper quartile Q3

interquartile range (IQR)

IQR = Q3 ndash Q1

measures spread of middle 50 of the data

Example beginning pulse rates

Q3 = 78 Q1 = 63

IQR = 78 ndash 63 = 15

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 43: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Other Graphical Methods for Data Time plots

plot observations in time order time on horizontal axis variable on vertical axis

Time series

measurements are taken at regular intervals (monthly unemployment quarterly GDP weather records electricity demand etc)

Heat maps word walls

Unemployment Rate by Educational Attainment

Water Use During Super Bowl XLV(Packers 31 Steelers 25)

Heat Maps

Word Wall (customer feedback)

Section 32Describing the Center of Data

Mean

Median

2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability (next section)

measures how ldquospread outrdquo the data is

Notation for Data Valuesand Sample Mean

1 2

1 2

3

The sample size is denoted by

For a variable denoted by its observations are denoted by

A common measure of center is the sample mean

The sample mean is denoted by

Shorte

n

n

y y yy

n

y

y y y y

y

n

1 21

1

ned expression for using the symbol

(uppercase Greek letter sigma)n

n

i

i n

i

i

y

y y y

yy

n

y

Simple Example of Sample Mean

Weekly TV viewing time in hours of 7 randomly selected 4th graders

19 40 16 12 10 6 and 97

1

7

1

19 40 16 12 10 6 9 112

11216

7 7

ii

ii

y

yy

Population Mean

1

population

population mea

Denoted by the Greek letter

is the size (for example =34000 for NCSU)

the value of is typically not known

we often use the sample mean

to estimat

n

e the unknown

N

ii

y

N N

y

N

value of

Connection Between Mean and Histogram

A histogram balances when supported at the mean Mean x = 1406

Histogram

0

10

20

30

40

50

60

70

118

5

125

5

132

5

139

5

146

5

153

5

16

05

Mo

re

Absences f rom Work

Fre

qu

en

cy

Frequency

The median anothermeasure of center

Given a set of n data values arranged in order of magnitude

Median= middle value n odd

mean of 2 middle values n even

Ex 2 4 6 8 10 n=5 median=6 Ex 2 4 6 8 n=4 median=(4+6)2=5

Student Pulse Rates (n=62)

38 59 60 60 62 62 63 63 64 64 65 67 68 70 70 70 70 70 70 70 71 71 72 72 73 74 74 75 75 75 75 76 77 77 77 77 78 78 79 79 80 80 80 84 84 85 85 87 90 90 91 92 93 94 94 95 96 96 96 98 98 103

Median = (75+76)2 = 755

The median splits the histogram into 2 halves of equal area

Mean balance pointMedian 50 area each half

mean 5526 years median 577years

Medians are used often

Year 2011 baseball salaries

Median $1450000 (max=$32000000 Alex Rodriguez min=$414000)

Median fan age MLB 45 NFL 43 NBA 41 NHL 39

Median existing home sales price May 2011 $166500 May 2010 $174600

Median household income (2008 dollars) 2009 $50221 2008 $52029

Examples Example n = 7

175 28 32 139 141 253 458 Example n = 7 (ordered) 28 32 139 141 175 253 458 Example n = 8

175 28 32 139 141 253 357 458

Example n =8 (ordered)

28 32 139 141 175 253 357 458

m = 141

m = (141+175)2 = 158

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496049604971524555467586

1 5245

2 49655

3 4960

4 4971

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496052455546497155877586

1 5245

2 49655

3 5546

4 4971

Properties of Mean Median1The mean and median are unique that is a

data set has only 1 mean and 1 median (the mean and median are not necessarily equal)

2The mean uses the value of every number in the data set the median does not

14

20 4 6Ex 2 4 6 8 5 5

4 2

21 4 6Ex 2 4 6 9 5 5

4 2

x m

x m

Example class pulse rates

53 64 67 67 70 76 77 77 78 83 84 85 85 89 90 90 90 90 91 96 98 103 140

23

1

23

844823

location 12th obs 85

ii

n

xx

m m

2010 2014 baseball salaries

2010

n = 845

mean = $3297828

median = $1330000

max = $33000000

2014

n = 848

mean = $3932912

median = $1456250

max = $28000000

>

Disadvantage of the mean

Can be greatly influenced by just a few observations that are much greater or much smaller than the rest of the data

Mean Median Maximum Baseball Salaries 1985 - 201419

85

1987

1989

1991

1993

1995

1997

1999

2001

2003

2005

2007

2009

2011

2013

200000

700000

1200000

1700000

2200000

2700000

3200000

3700000

0

5000000

10000000

15000000

20000000

25000000

30000000

35000000

Baseball Salaries Mean Median and Maximum 1985-2014

Mean Median Maximum

Year

Mea

n M

edia

n S

alar

y

Max

imu

m S

alar

y

Skewness comparing the mean and median

Skewed to the right (positively skewed) meangtmedian

53

490

102 7235 21 26 17 8 10 2 3 1 0 0 1

0

100

200

300

400

500

600

Freq

uenc

y

Salary ($1000s)

2011 Baseball Salaries

Skewed to the left negatively skewed

Mean lt median mean=78 median=87

Histogram of Exam Scores

0

10

20

30

20 30 40 50 60 70 80 90 100Exam Scores

Fre

qu

en

cy

Symmetric data

mean median approx equal

Bank Customers 1000-1100 am

0

5

10

15

20

Number of Customers

Fre

qu

en

cy

Section 33Describing Variability of Data

Standard Deviation

Using the Mean and Standard Deviation Together 68-95-997

Rule (Empirical Rule)

Recall 2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability

measures how ldquospread outrdquo the data is

Ways to measure variability

1 range=largest-smallest

ok sometimes in general too crude sensitive to one large or small obs

1

2 where

the middle is the mean

deviation of from the mean

( ) sum the deviations of all the s from

measure spread from the middle

i i

n

i ii

y

y y y

y y y y

1

( ) 0 always tells us nothingn

ii

y y

Example

1 2

1 2

1 2

1 2

sum of deviations from mean

49 51 50

( ) ( ) (49 50) (51 50) 1 1 0

0 100

Data set 1

Data set 2 50

( ) ( ) (0 50) (100 50) 50 50 0

x x x

x x x x

y y y

y y y y

The Sample Standard Deviation a measure of spread around the mean Square the deviation of each

observation from the mean find the square root of the ldquoaveragerdquo of these squared deviations

2

1

2

2 1

( )sample standard deviation

1

( )is called the sample variance

1

n

ii

n

ii

y ys

n

y ys

n

Calculations hellip

Mean = 634

Sum of squared deviations from mean = 852

(n minus 1) = 13 (n minus 1) is called degrees freedom (df)

s2 = variance = 85213 = 655 square inches

s = standard deviation = radic655 = 256 inches

Women height (inches)i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

2

1

2 )(1

1xx

ns

n

i

1 First calculate the variance s22 Then take the square root to get the

standard deviation s

2

1

)(1

1xx

ns

n

i

Meanplusmn 1 sd

Wersquoll never calculate these by hand so make sure to know how to get the standard deviation using your calculator Excel or other software

Population Standard Deviation

2

1

Denoted by the lower case Greek letter

is the size (for example =34000 for NCSU)

is the mean

( )population standard deviation

va

po

lue of typically not known

us

pulation

populatio

e

n

N

ii

N N

y

N

s

to estimate value of

Remarks

1 The standard deviation of a set of measurements is an estimate of the likely size of the chance error in a single measurement

Remarks (cont)

2 Note that s and s are always greater than or equal to zero

3 The larger the value of s (or s ) the greater the spread of the data

When does s=0 When does s =0

When all data values are the same

Remarks (cont)4 The standard deviation is the most

commonly used measure of risk in finance and businessndash Stocks Mutual Funds etc

5 Variance s2 sample variance 2 population variance Units are squared units of the original data square $ square gallons

Review Properties of s and s s and s are always greater than or

equal to 0

when does s = 0 s = 0 The larger the value of s (or s) the

greater the spread of the data the standard deviation of a set of

measurements is an estimate of the likely size of the chance error in a single measurement

Summary of Notation

2

SAMPLE

sample mean

sample median

sample variance

sample stand dev

y

m

s

s

2

POPULATION

population mean

population median

population variance

population stand dev

m

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

68-95-997 rule

Mean andStandard Deviation

(numerical)

Histogram(graphical)

68-95-997 rule

The 68-95-997 ruleIf the histogram of the data is

approximately bell-shaped then1) approximately of the measurements

are of the mean

that is in ( )

2) approximately of the measurement

68

within 1 standard deviation

95

within 2 standard deviation

s

are of the meas n

that is

y s y s

almost all

within 3 standard deviation

in ( 2 2 )

3) the measurements

are of the mean

that is in ( 3 3 )

s

y s y s

y s y s

68-95-997 rule 68 within 1 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

68

3434

y-s y y+s

68-95-997 rule 95 within 2 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

95

475 475

y-2s y y+2s

Example textbook costs

37548

4272

50

y

s

n

286 291 307 308 315 316 327328 340 342 346 347 348 348 349354 355 355 360 361 364 367 369371 373 377 380 381 382 385 385387 390 390 397 398 409 409 410418 422 424 425 426 428 433 434437 440 480

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( ) (33276 41820)

32percentage of data values in this interval 64

5068-95-997 rule 68

y s

y s y s

1 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 2 2 ) (29004 46092)

48percentage of data values in this interval 96

5068-95-997 rule 95

y s

y s y s

2 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 3 3 ) (24732 50364)

50percentage of data values in this interval 100

5068-95-997 rule 997

y s

y s y s

3 standard deviation interval about the mean

The best estimate of the standard deviation of the menrsquos weights

displayed in this dotplot is

1 10

2 15

3 20

4 40

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

Preceding slides Next

Z-scores Standardized Data Values

Measures the distance of a number from the mean in units of

the standard deviation

z-score corresponding to y

where

original data value

the sample mean

s the sample standard deviation

the z-score corresponding to

y yz

s

y

y

z y

Exam 1 y1 = 88 s1 = 6 your exam 1 score 91

Exam 2 y2 = 88 s2 = 10 your exam 2 score 92

Which score is better

1

2

91 88 3z 5

6 692 88 4

z 410 10

91 on exam 1 is better than 92 on exam 2

If data has mean and standard deviation

then standardizing a particular value of

indicates how many standard deviations

is above or below the mean

y s

y

y

y

Comparing SAT and ACT Scores

SAT Math Eleanorrsquos score 680

SAT mean =500 sd=100 ACT Math Geraldrsquos score 27

ACT mean=18 sd=6 Eleanorrsquos z-score z=(680-500)100=18 Geraldrsquos z-score z=(27-18)6=15 Eleanorrsquos score is better

Z-scores add to zeroStudentInstitutional Support to Athletic Depts For the 9 Public ACC

Schools 2013 ($ millions)

School Support y - ybar Z-score

Maryland 155 64 179

UVA 131 40 112

Louisville 109 18 050

UNC 92 01 003

VaTech 79 -12 -034

FSU 79 -12 -034

GaTech 71 -20 -056

NCSU 65 -26 -073

Clemson 38 -53 -147

Mean=91000 s=35697

Sum = 0 Sum = 0

Recently the mean tuition at 4-yr public collegesuniversities in the US was $6185 with a standard deviation of $1804 In NC the mean tuition was $4320 What is NCrsquos z-score

1 103

2 -103

3 239

4 1865

5 -1865

Section 34Measures of Position (also called Measures of Relative Standing)

Quartiles

5-Number Summary

Interquartile Range Another Measure of Spread

Boxplots

m = median = 34

Q1= first quartile = 23

Q3= third quartile = 42

1 1 062 2 123 3 164 4 195 5 156 6 217 7 238 6 239 5 2510 4 2811 3 2912 2 3313 1 3414 2 3615 3 3716 4 3817 5 3918 6 4119 7 4220 6 4521 5 4722 4 4923 3 5324 2 5625 1 61

Quartiles Measuring spread by examining the middleThe first quartile Q1 is the value in the

sample that has 25 of the data at or

below it (Q1 is the median of the lower

half of the sorted data)

The third quartile Q3 is the value in the

sample that has 75 of the data at or

below it (Q3 is the median of the upper

half of the sorted data)

Quartiles and median divide data into 4 pieces

Q1 M Q3

14 14 14 14

Quartiles are common measures of spread

httpoirpncsueduiradmit

httpoirpncsueduunivpeer

University of Southern California

Economic Value of College Majors

Rules for Calculating QuartilesStep 1 find the median of all the data (the median divides the data in half)

Step 2a find the median of the lower half this median is Q1Step 2b find the median of the upper half this median is Q3

Importantwhen n is odd include the overall median in both halveswhen n is even do not include the overall median in either half

Example 2 4 6 8 10 12 14 16 18 20 n = 10

Median m = (10+12)2 = 222 = 11

Q1 median of lower half 2 4 6 8 10

Q1 = 6

Q3 median of upper half 12 14 16 18 20

Q3 = 16

11

Pulse Rates n = 138

Stem Leaves4

3 4 5889 5 00123344410 5 555678889923 6 0001111112223333334444423 6 5555666666777778888888816 7 0000011222233444423 7 5555566666677788888899910 8 000011222410 8 55556677894 9 00122 9 584 10 0223

101 11 1

Median mean of pulses in locations 69 amp 70 median= (70+70)2=70

Q1 median of lower half (lower half = 69 smallest pulses) Q1 = pulse in ordered position 35Q1 = 63

Q3 median of upper half (upper half = 69 largest pulses) Q3= pulse in position 35 from the high end Q3=78

Below are the weights of 31 linemen on the NCSU football team What is the

value of the first quartile Q1

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 287

2 2575

3 2635

4 2625

Interquartile range another measure of spread

lower quartile Q1

middle quartile median upper quartile Q3

interquartile range (IQR)

IQR = Q3 ndash Q1

measures spread of middle 50 of the data

Example beginning pulse rates

Q3 = 78 Q1 = 63

IQR = 78 ndash 63 = 15

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 44: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Unemployment Rate by Educational Attainment

Water Use During Super Bowl XLV(Packers 31 Steelers 25)

Heat Maps

Word Wall (customer feedback)

Section 32Describing the Center of Data

Mean

Median

2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability (next section)

measures how ldquospread outrdquo the data is

Notation for Data Valuesand Sample Mean

1 2

1 2

3

The sample size is denoted by

For a variable denoted by its observations are denoted by

A common measure of center is the sample mean

The sample mean is denoted by

Shorte

n

n

y y yy

n

y

y y y y

y

n

1 21

1

ned expression for using the symbol

(uppercase Greek letter sigma)n

n

i

i n

i

i

y

y y y

yy

n

y

Simple Example of Sample Mean

Weekly TV viewing time in hours of 7 randomly selected 4th graders

19 40 16 12 10 6 and 97

1

7

1

19 40 16 12 10 6 9 112

11216

7 7

ii

ii

y

yy

Population Mean

1

population

population mea

Denoted by the Greek letter

is the size (for example =34000 for NCSU)

the value of is typically not known

we often use the sample mean

to estimat

n

e the unknown

N

ii

y

N N

y

N

value of

Connection Between Mean and Histogram

A histogram balances when supported at the mean Mean x = 1406

Histogram

0

10

20

30

40

50

60

70

118

5

125

5

132

5

139

5

146

5

153

5

16

05

Mo

re

Absences f rom Work

Fre

qu

en

cy

Frequency

The median anothermeasure of center

Given a set of n data values arranged in order of magnitude

Median= middle value n odd

mean of 2 middle values n even

Ex 2 4 6 8 10 n=5 median=6 Ex 2 4 6 8 n=4 median=(4+6)2=5

Student Pulse Rates (n=62)

38 59 60 60 62 62 63 63 64 64 65 67 68 70 70 70 70 70 70 70 71 71 72 72 73 74 74 75 75 75 75 76 77 77 77 77 78 78 79 79 80 80 80 84 84 85 85 87 90 90 91 92 93 94 94 95 96 96 96 98 98 103

Median = (75+76)2 = 755

The median splits the histogram into 2 halves of equal area

Mean balance pointMedian 50 area each half

mean 5526 years median 577years

Medians are used often

Year 2011 baseball salaries

Median $1450000 (max=$32000000 Alex Rodriguez min=$414000)

Median fan age MLB 45 NFL 43 NBA 41 NHL 39

Median existing home sales price May 2011 $166500 May 2010 $174600

Median household income (2008 dollars) 2009 $50221 2008 $52029

Examples Example n = 7

175 28 32 139 141 253 458 Example n = 7 (ordered) 28 32 139 141 175 253 458 Example n = 8

175 28 32 139 141 253 357 458

Example n =8 (ordered)

28 32 139 141 175 253 357 458

m = 141

m = (141+175)2 = 158

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496049604971524555467586

1 5245

2 49655

3 4960

4 4971

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496052455546497155877586

1 5245

2 49655

3 5546

4 4971

Properties of Mean Median1The mean and median are unique that is a

data set has only 1 mean and 1 median (the mean and median are not necessarily equal)

2The mean uses the value of every number in the data set the median does not

14

20 4 6Ex 2 4 6 8 5 5

4 2

21 4 6Ex 2 4 6 9 5 5

4 2

x m

x m

Example class pulse rates

53 64 67 67 70 76 77 77 78 83 84 85 85 89 90 90 90 90 91 96 98 103 140

23

1

23

844823

location 12th obs 85

ii

n

xx

m m

2010 2014 baseball salaries

2010

n = 845

mean = $3297828

median = $1330000

max = $33000000

2014

n = 848

mean = $3932912

median = $1456250

max = $28000000

>

Disadvantage of the mean

Can be greatly influenced by just a few observations that are much greater or much smaller than the rest of the data

Mean Median Maximum Baseball Salaries 1985 - 201419

85

1987

1989

1991

1993

1995

1997

1999

2001

2003

2005

2007

2009

2011

2013

200000

700000

1200000

1700000

2200000

2700000

3200000

3700000

0

5000000

10000000

15000000

20000000

25000000

30000000

35000000

Baseball Salaries Mean Median and Maximum 1985-2014

Mean Median Maximum

Year

Mea

n M

edia

n S

alar

y

Max

imu

m S

alar

y

Skewness comparing the mean and median

Skewed to the right (positively skewed) meangtmedian

53

490

102 7235 21 26 17 8 10 2 3 1 0 0 1

0

100

200

300

400

500

600

Freq

uenc

y

Salary ($1000s)

2011 Baseball Salaries

Skewed to the left negatively skewed

Mean lt median mean=78 median=87

Histogram of Exam Scores

0

10

20

30

20 30 40 50 60 70 80 90 100Exam Scores

Fre

qu

en

cy

Symmetric data

mean median approx equal

Bank Customers 1000-1100 am

0

5

10

15

20

Number of Customers

Fre

qu

en

cy

Section 33Describing Variability of Data

Standard Deviation

Using the Mean and Standard Deviation Together 68-95-997

Rule (Empirical Rule)

Recall 2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability

measures how ldquospread outrdquo the data is

Ways to measure variability

1 range=largest-smallest

ok sometimes in general too crude sensitive to one large or small obs

1

2 where

the middle is the mean

deviation of from the mean

( ) sum the deviations of all the s from

measure spread from the middle

i i

n

i ii

y

y y y

y y y y

1

( ) 0 always tells us nothingn

ii

y y

Example

1 2

1 2

1 2

1 2

sum of deviations from mean

49 51 50

( ) ( ) (49 50) (51 50) 1 1 0

0 100

Data set 1

Data set 2 50

( ) ( ) (0 50) (100 50) 50 50 0

x x x

x x x x

y y y

y y y y

The Sample Standard Deviation a measure of spread around the mean Square the deviation of each

observation from the mean find the square root of the ldquoaveragerdquo of these squared deviations

2

1

2

2 1

( )sample standard deviation

1

( )is called the sample variance

1

n

ii

n

ii

y ys

n

y ys

n

Calculations hellip

Mean = 634

Sum of squared deviations from mean = 852

(n minus 1) = 13 (n minus 1) is called degrees freedom (df)

s2 = variance = 85213 = 655 square inches

s = standard deviation = radic655 = 256 inches

Women height (inches)i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

2

1

2 )(1

1xx

ns

n

i

1 First calculate the variance s22 Then take the square root to get the

standard deviation s

2

1

)(1

1xx

ns

n

i

Meanplusmn 1 sd

Wersquoll never calculate these by hand so make sure to know how to get the standard deviation using your calculator Excel or other software

Population Standard Deviation

2

1

Denoted by the lower case Greek letter

is the size (for example =34000 for NCSU)

is the mean

( )population standard deviation

va

po

lue of typically not known

us

pulation

populatio

e

n

N

ii

N N

y

N

s

to estimate value of

Remarks

1 The standard deviation of a set of measurements is an estimate of the likely size of the chance error in a single measurement

Remarks (cont)

2 Note that s and s are always greater than or equal to zero

3 The larger the value of s (or s ) the greater the spread of the data

When does s=0 When does s =0

When all data values are the same

Remarks (cont)4 The standard deviation is the most

commonly used measure of risk in finance and businessndash Stocks Mutual Funds etc

5 Variance s2 sample variance 2 population variance Units are squared units of the original data square $ square gallons

Review Properties of s and s s and s are always greater than or

equal to 0

when does s = 0 s = 0 The larger the value of s (or s) the

greater the spread of the data the standard deviation of a set of

measurements is an estimate of the likely size of the chance error in a single measurement

Summary of Notation

2

SAMPLE

sample mean

sample median

sample variance

sample stand dev

y

m

s

s

2

POPULATION

population mean

population median

population variance

population stand dev

m

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

68-95-997 rule

Mean andStandard Deviation

(numerical)

Histogram(graphical)

68-95-997 rule

The 68-95-997 ruleIf the histogram of the data is

approximately bell-shaped then1) approximately of the measurements

are of the mean

that is in ( )

2) approximately of the measurement

68

within 1 standard deviation

95

within 2 standard deviation

s

are of the meas n

that is

y s y s

almost all

within 3 standard deviation

in ( 2 2 )

3) the measurements

are of the mean

that is in ( 3 3 )

s

y s y s

y s y s

68-95-997 rule 68 within 1 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

68

3434

y-s y y+s

68-95-997 rule 95 within 2 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

95

475 475

y-2s y y+2s

Example textbook costs

37548

4272

50

y

s

n

286 291 307 308 315 316 327328 340 342 346 347 348 348 349354 355 355 360 361 364 367 369371 373 377 380 381 382 385 385387 390 390 397 398 409 409 410418 422 424 425 426 428 433 434437 440 480

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( ) (33276 41820)

32percentage of data values in this interval 64

5068-95-997 rule 68

y s

y s y s

1 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 2 2 ) (29004 46092)

48percentage of data values in this interval 96

5068-95-997 rule 95

y s

y s y s

2 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 3 3 ) (24732 50364)

50percentage of data values in this interval 100

5068-95-997 rule 997

y s

y s y s

3 standard deviation interval about the mean

The best estimate of the standard deviation of the menrsquos weights

displayed in this dotplot is

1 10

2 15

3 20

4 40

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

Preceding slides Next

Z-scores Standardized Data Values

Measures the distance of a number from the mean in units of

the standard deviation

z-score corresponding to y

where

original data value

the sample mean

s the sample standard deviation

the z-score corresponding to

y yz

s

y

y

z y

Exam 1 y1 = 88 s1 = 6 your exam 1 score 91

Exam 2 y2 = 88 s2 = 10 your exam 2 score 92

Which score is better

1

2

91 88 3z 5

6 692 88 4

z 410 10

91 on exam 1 is better than 92 on exam 2

If data has mean and standard deviation

then standardizing a particular value of

indicates how many standard deviations

is above or below the mean

y s

y

y

y

Comparing SAT and ACT Scores

SAT Math Eleanorrsquos score 680

SAT mean =500 sd=100 ACT Math Geraldrsquos score 27

ACT mean=18 sd=6 Eleanorrsquos z-score z=(680-500)100=18 Geraldrsquos z-score z=(27-18)6=15 Eleanorrsquos score is better

Z-scores add to zeroStudentInstitutional Support to Athletic Depts For the 9 Public ACC

Schools 2013 ($ millions)

School Support y - ybar Z-score

Maryland 155 64 179

UVA 131 40 112

Louisville 109 18 050

UNC 92 01 003

VaTech 79 -12 -034

FSU 79 -12 -034

GaTech 71 -20 -056

NCSU 65 -26 -073

Clemson 38 -53 -147

Mean=91000 s=35697

Sum = 0 Sum = 0

Recently the mean tuition at 4-yr public collegesuniversities in the US was $6185 with a standard deviation of $1804 In NC the mean tuition was $4320 What is NCrsquos z-score

1 103

2 -103

3 239

4 1865

5 -1865

Section 34Measures of Position (also called Measures of Relative Standing)

Quartiles

5-Number Summary

Interquartile Range Another Measure of Spread

Boxplots

m = median = 34

Q1= first quartile = 23

Q3= third quartile = 42

1 1 062 2 123 3 164 4 195 5 156 6 217 7 238 6 239 5 2510 4 2811 3 2912 2 3313 1 3414 2 3615 3 3716 4 3817 5 3918 6 4119 7 4220 6 4521 5 4722 4 4923 3 5324 2 5625 1 61

Quartiles Measuring spread by examining the middleThe first quartile Q1 is the value in the

sample that has 25 of the data at or

below it (Q1 is the median of the lower

half of the sorted data)

The third quartile Q3 is the value in the

sample that has 75 of the data at or

below it (Q3 is the median of the upper

half of the sorted data)

Quartiles and median divide data into 4 pieces

Q1 M Q3

14 14 14 14

Quartiles are common measures of spread

httpoirpncsueduiradmit

httpoirpncsueduunivpeer

University of Southern California

Economic Value of College Majors

Rules for Calculating QuartilesStep 1 find the median of all the data (the median divides the data in half)

Step 2a find the median of the lower half this median is Q1Step 2b find the median of the upper half this median is Q3

Importantwhen n is odd include the overall median in both halveswhen n is even do not include the overall median in either half

Example 2 4 6 8 10 12 14 16 18 20 n = 10

Median m = (10+12)2 = 222 = 11

Q1 median of lower half 2 4 6 8 10

Q1 = 6

Q3 median of upper half 12 14 16 18 20

Q3 = 16

11

Pulse Rates n = 138

Stem Leaves4

3 4 5889 5 00123344410 5 555678889923 6 0001111112223333334444423 6 5555666666777778888888816 7 0000011222233444423 7 5555566666677788888899910 8 000011222410 8 55556677894 9 00122 9 584 10 0223

101 11 1

Median mean of pulses in locations 69 amp 70 median= (70+70)2=70

Q1 median of lower half (lower half = 69 smallest pulses) Q1 = pulse in ordered position 35Q1 = 63

Q3 median of upper half (upper half = 69 largest pulses) Q3= pulse in position 35 from the high end Q3=78

Below are the weights of 31 linemen on the NCSU football team What is the

value of the first quartile Q1

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 287

2 2575

3 2635

4 2625

Interquartile range another measure of spread

lower quartile Q1

middle quartile median upper quartile Q3

interquartile range (IQR)

IQR = Q3 ndash Q1

measures spread of middle 50 of the data

Example beginning pulse rates

Q3 = 78 Q1 = 63

IQR = 78 ndash 63 = 15

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 45: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Water Use During Super Bowl XLV(Packers 31 Steelers 25)

Heat Maps

Word Wall (customer feedback)

Section 32Describing the Center of Data

Mean

Median

2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability (next section)

measures how ldquospread outrdquo the data is

Notation for Data Valuesand Sample Mean

1 2

1 2

3

The sample size is denoted by

For a variable denoted by its observations are denoted by

A common measure of center is the sample mean

The sample mean is denoted by

Shorte

n

n

y y yy

n

y

y y y y

y

n

1 21

1

ned expression for using the symbol

(uppercase Greek letter sigma)n

n

i

i n

i

i

y

y y y

yy

n

y

Simple Example of Sample Mean

Weekly TV viewing time in hours of 7 randomly selected 4th graders

19 40 16 12 10 6 and 97

1

7

1

19 40 16 12 10 6 9 112

11216

7 7

ii

ii

y

yy

Population Mean

1

population

population mea

Denoted by the Greek letter

is the size (for example =34000 for NCSU)

the value of is typically not known

we often use the sample mean

to estimat

n

e the unknown

N

ii

y

N N

y

N

value of

Connection Between Mean and Histogram

A histogram balances when supported at the mean Mean x = 1406

Histogram

0

10

20

30

40

50

60

70

118

5

125

5

132

5

139

5

146

5

153

5

16

05

Mo

re

Absences f rom Work

Fre

qu

en

cy

Frequency

The median anothermeasure of center

Given a set of n data values arranged in order of magnitude

Median= middle value n odd

mean of 2 middle values n even

Ex 2 4 6 8 10 n=5 median=6 Ex 2 4 6 8 n=4 median=(4+6)2=5

Student Pulse Rates (n=62)

38 59 60 60 62 62 63 63 64 64 65 67 68 70 70 70 70 70 70 70 71 71 72 72 73 74 74 75 75 75 75 76 77 77 77 77 78 78 79 79 80 80 80 84 84 85 85 87 90 90 91 92 93 94 94 95 96 96 96 98 98 103

Median = (75+76)2 = 755

The median splits the histogram into 2 halves of equal area

Mean balance pointMedian 50 area each half

mean 5526 years median 577years

Medians are used often

Year 2011 baseball salaries

Median $1450000 (max=$32000000 Alex Rodriguez min=$414000)

Median fan age MLB 45 NFL 43 NBA 41 NHL 39

Median existing home sales price May 2011 $166500 May 2010 $174600

Median household income (2008 dollars) 2009 $50221 2008 $52029

Examples Example n = 7

175 28 32 139 141 253 458 Example n = 7 (ordered) 28 32 139 141 175 253 458 Example n = 8

175 28 32 139 141 253 357 458

Example n =8 (ordered)

28 32 139 141 175 253 357 458

m = 141

m = (141+175)2 = 158

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496049604971524555467586

1 5245

2 49655

3 4960

4 4971

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496052455546497155877586

1 5245

2 49655

3 5546

4 4971

Properties of Mean Median1The mean and median are unique that is a

data set has only 1 mean and 1 median (the mean and median are not necessarily equal)

2The mean uses the value of every number in the data set the median does not

14

20 4 6Ex 2 4 6 8 5 5

4 2

21 4 6Ex 2 4 6 9 5 5

4 2

x m

x m

Example class pulse rates

53 64 67 67 70 76 77 77 78 83 84 85 85 89 90 90 90 90 91 96 98 103 140

23

1

23

844823

location 12th obs 85

ii

n

xx

m m

2010 2014 baseball salaries

2010

n = 845

mean = $3297828

median = $1330000

max = $33000000

2014

n = 848

mean = $3932912

median = $1456250

max = $28000000

>

Disadvantage of the mean

Can be greatly influenced by just a few observations that are much greater or much smaller than the rest of the data

Mean Median Maximum Baseball Salaries 1985 - 201419

85

1987

1989

1991

1993

1995

1997

1999

2001

2003

2005

2007

2009

2011

2013

200000

700000

1200000

1700000

2200000

2700000

3200000

3700000

0

5000000

10000000

15000000

20000000

25000000

30000000

35000000

Baseball Salaries Mean Median and Maximum 1985-2014

Mean Median Maximum

Year

Mea

n M

edia

n S

alar

y

Max

imu

m S

alar

y

Skewness comparing the mean and median

Skewed to the right (positively skewed) meangtmedian

53

490

102 7235 21 26 17 8 10 2 3 1 0 0 1

0

100

200

300

400

500

600

Freq

uenc

y

Salary ($1000s)

2011 Baseball Salaries

Skewed to the left negatively skewed

Mean lt median mean=78 median=87

Histogram of Exam Scores

0

10

20

30

20 30 40 50 60 70 80 90 100Exam Scores

Fre

qu

en

cy

Symmetric data

mean median approx equal

Bank Customers 1000-1100 am

0

5

10

15

20

Number of Customers

Fre

qu

en

cy

Section 33Describing Variability of Data

Standard Deviation

Using the Mean and Standard Deviation Together 68-95-997

Rule (Empirical Rule)

Recall 2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability

measures how ldquospread outrdquo the data is

Ways to measure variability

1 range=largest-smallest

ok sometimes in general too crude sensitive to one large or small obs

1

2 where

the middle is the mean

deviation of from the mean

( ) sum the deviations of all the s from

measure spread from the middle

i i

n

i ii

y

y y y

y y y y

1

( ) 0 always tells us nothingn

ii

y y

Example

1 2

1 2

1 2

1 2

sum of deviations from mean

49 51 50

( ) ( ) (49 50) (51 50) 1 1 0

0 100

Data set 1

Data set 2 50

( ) ( ) (0 50) (100 50) 50 50 0

x x x

x x x x

y y y

y y y y

The Sample Standard Deviation a measure of spread around the mean Square the deviation of each

observation from the mean find the square root of the ldquoaveragerdquo of these squared deviations

2

1

2

2 1

( )sample standard deviation

1

( )is called the sample variance

1

n

ii

n

ii

y ys

n

y ys

n

Calculations hellip

Mean = 634

Sum of squared deviations from mean = 852

(n minus 1) = 13 (n minus 1) is called degrees freedom (df)

s2 = variance = 85213 = 655 square inches

s = standard deviation = radic655 = 256 inches

Women height (inches)i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

2

1

2 )(1

1xx

ns

n

i

1 First calculate the variance s22 Then take the square root to get the

standard deviation s

2

1

)(1

1xx

ns

n

i

Meanplusmn 1 sd

Wersquoll never calculate these by hand so make sure to know how to get the standard deviation using your calculator Excel or other software

Population Standard Deviation

2

1

Denoted by the lower case Greek letter

is the size (for example =34000 for NCSU)

is the mean

( )population standard deviation

va

po

lue of typically not known

us

pulation

populatio

e

n

N

ii

N N

y

N

s

to estimate value of

Remarks

1 The standard deviation of a set of measurements is an estimate of the likely size of the chance error in a single measurement

Remarks (cont)

2 Note that s and s are always greater than or equal to zero

3 The larger the value of s (or s ) the greater the spread of the data

When does s=0 When does s =0

When all data values are the same

Remarks (cont)4 The standard deviation is the most

commonly used measure of risk in finance and businessndash Stocks Mutual Funds etc

5 Variance s2 sample variance 2 population variance Units are squared units of the original data square $ square gallons

Review Properties of s and s s and s are always greater than or

equal to 0

when does s = 0 s = 0 The larger the value of s (or s) the

greater the spread of the data the standard deviation of a set of

measurements is an estimate of the likely size of the chance error in a single measurement

Summary of Notation

2

SAMPLE

sample mean

sample median

sample variance

sample stand dev

y

m

s

s

2

POPULATION

population mean

population median

population variance

population stand dev

m

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

68-95-997 rule

Mean andStandard Deviation

(numerical)

Histogram(graphical)

68-95-997 rule

The 68-95-997 ruleIf the histogram of the data is

approximately bell-shaped then1) approximately of the measurements

are of the mean

that is in ( )

2) approximately of the measurement

68

within 1 standard deviation

95

within 2 standard deviation

s

are of the meas n

that is

y s y s

almost all

within 3 standard deviation

in ( 2 2 )

3) the measurements

are of the mean

that is in ( 3 3 )

s

y s y s

y s y s

68-95-997 rule 68 within 1 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

68

3434

y-s y y+s

68-95-997 rule 95 within 2 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

95

475 475

y-2s y y+2s

Example textbook costs

37548

4272

50

y

s

n

286 291 307 308 315 316 327328 340 342 346 347 348 348 349354 355 355 360 361 364 367 369371 373 377 380 381 382 385 385387 390 390 397 398 409 409 410418 422 424 425 426 428 433 434437 440 480

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( ) (33276 41820)

32percentage of data values in this interval 64

5068-95-997 rule 68

y s

y s y s

1 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 2 2 ) (29004 46092)

48percentage of data values in this interval 96

5068-95-997 rule 95

y s

y s y s

2 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 3 3 ) (24732 50364)

50percentage of data values in this interval 100

5068-95-997 rule 997

y s

y s y s

3 standard deviation interval about the mean

The best estimate of the standard deviation of the menrsquos weights

displayed in this dotplot is

1 10

2 15

3 20

4 40

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

Preceding slides Next

Z-scores Standardized Data Values

Measures the distance of a number from the mean in units of

the standard deviation

z-score corresponding to y

where

original data value

the sample mean

s the sample standard deviation

the z-score corresponding to

y yz

s

y

y

z y

Exam 1 y1 = 88 s1 = 6 your exam 1 score 91

Exam 2 y2 = 88 s2 = 10 your exam 2 score 92

Which score is better

1

2

91 88 3z 5

6 692 88 4

z 410 10

91 on exam 1 is better than 92 on exam 2

If data has mean and standard deviation

then standardizing a particular value of

indicates how many standard deviations

is above or below the mean

y s

y

y

y

Comparing SAT and ACT Scores

SAT Math Eleanorrsquos score 680

SAT mean =500 sd=100 ACT Math Geraldrsquos score 27

ACT mean=18 sd=6 Eleanorrsquos z-score z=(680-500)100=18 Geraldrsquos z-score z=(27-18)6=15 Eleanorrsquos score is better

Z-scores add to zeroStudentInstitutional Support to Athletic Depts For the 9 Public ACC

Schools 2013 ($ millions)

School Support y - ybar Z-score

Maryland 155 64 179

UVA 131 40 112

Louisville 109 18 050

UNC 92 01 003

VaTech 79 -12 -034

FSU 79 -12 -034

GaTech 71 -20 -056

NCSU 65 -26 -073

Clemson 38 -53 -147

Mean=91000 s=35697

Sum = 0 Sum = 0

Recently the mean tuition at 4-yr public collegesuniversities in the US was $6185 with a standard deviation of $1804 In NC the mean tuition was $4320 What is NCrsquos z-score

1 103

2 -103

3 239

4 1865

5 -1865

Section 34Measures of Position (also called Measures of Relative Standing)

Quartiles

5-Number Summary

Interquartile Range Another Measure of Spread

Boxplots

m = median = 34

Q1= first quartile = 23

Q3= third quartile = 42

1 1 062 2 123 3 164 4 195 5 156 6 217 7 238 6 239 5 2510 4 2811 3 2912 2 3313 1 3414 2 3615 3 3716 4 3817 5 3918 6 4119 7 4220 6 4521 5 4722 4 4923 3 5324 2 5625 1 61

Quartiles Measuring spread by examining the middleThe first quartile Q1 is the value in the

sample that has 25 of the data at or

below it (Q1 is the median of the lower

half of the sorted data)

The third quartile Q3 is the value in the

sample that has 75 of the data at or

below it (Q3 is the median of the upper

half of the sorted data)

Quartiles and median divide data into 4 pieces

Q1 M Q3

14 14 14 14

Quartiles are common measures of spread

httpoirpncsueduiradmit

httpoirpncsueduunivpeer

University of Southern California

Economic Value of College Majors

Rules for Calculating QuartilesStep 1 find the median of all the data (the median divides the data in half)

Step 2a find the median of the lower half this median is Q1Step 2b find the median of the upper half this median is Q3

Importantwhen n is odd include the overall median in both halveswhen n is even do not include the overall median in either half

Example 2 4 6 8 10 12 14 16 18 20 n = 10

Median m = (10+12)2 = 222 = 11

Q1 median of lower half 2 4 6 8 10

Q1 = 6

Q3 median of upper half 12 14 16 18 20

Q3 = 16

11

Pulse Rates n = 138

Stem Leaves4

3 4 5889 5 00123344410 5 555678889923 6 0001111112223333334444423 6 5555666666777778888888816 7 0000011222233444423 7 5555566666677788888899910 8 000011222410 8 55556677894 9 00122 9 584 10 0223

101 11 1

Median mean of pulses in locations 69 amp 70 median= (70+70)2=70

Q1 median of lower half (lower half = 69 smallest pulses) Q1 = pulse in ordered position 35Q1 = 63

Q3 median of upper half (upper half = 69 largest pulses) Q3= pulse in position 35 from the high end Q3=78

Below are the weights of 31 linemen on the NCSU football team What is the

value of the first quartile Q1

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 287

2 2575

3 2635

4 2625

Interquartile range another measure of spread

lower quartile Q1

middle quartile median upper quartile Q3

interquartile range (IQR)

IQR = Q3 ndash Q1

measures spread of middle 50 of the data

Example beginning pulse rates

Q3 = 78 Q1 = 63

IQR = 78 ndash 63 = 15

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 46: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Heat Maps

Word Wall (customer feedback)

Section 32Describing the Center of Data

Mean

Median

2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability (next section)

measures how ldquospread outrdquo the data is

Notation for Data Valuesand Sample Mean

1 2

1 2

3

The sample size is denoted by

For a variable denoted by its observations are denoted by

A common measure of center is the sample mean

The sample mean is denoted by

Shorte

n

n

y y yy

n

y

y y y y

y

n

1 21

1

ned expression for using the symbol

(uppercase Greek letter sigma)n

n

i

i n

i

i

y

y y y

yy

n

y

Simple Example of Sample Mean

Weekly TV viewing time in hours of 7 randomly selected 4th graders

19 40 16 12 10 6 and 97

1

7

1

19 40 16 12 10 6 9 112

11216

7 7

ii

ii

y

yy

Population Mean

1

population

population mea

Denoted by the Greek letter

is the size (for example =34000 for NCSU)

the value of is typically not known

we often use the sample mean

to estimat

n

e the unknown

N

ii

y

N N

y

N

value of

Connection Between Mean and Histogram

A histogram balances when supported at the mean Mean x = 1406

Histogram

0

10

20

30

40

50

60

70

118

5

125

5

132

5

139

5

146

5

153

5

16

05

Mo

re

Absences f rom Work

Fre

qu

en

cy

Frequency

The median anothermeasure of center

Given a set of n data values arranged in order of magnitude

Median= middle value n odd

mean of 2 middle values n even

Ex 2 4 6 8 10 n=5 median=6 Ex 2 4 6 8 n=4 median=(4+6)2=5

Student Pulse Rates (n=62)

38 59 60 60 62 62 63 63 64 64 65 67 68 70 70 70 70 70 70 70 71 71 72 72 73 74 74 75 75 75 75 76 77 77 77 77 78 78 79 79 80 80 80 84 84 85 85 87 90 90 91 92 93 94 94 95 96 96 96 98 98 103

Median = (75+76)2 = 755

The median splits the histogram into 2 halves of equal area

Mean balance pointMedian 50 area each half

mean 5526 years median 577years

Medians are used often

Year 2011 baseball salaries

Median $1450000 (max=$32000000 Alex Rodriguez min=$414000)

Median fan age MLB 45 NFL 43 NBA 41 NHL 39

Median existing home sales price May 2011 $166500 May 2010 $174600

Median household income (2008 dollars) 2009 $50221 2008 $52029

Examples Example n = 7

175 28 32 139 141 253 458 Example n = 7 (ordered) 28 32 139 141 175 253 458 Example n = 8

175 28 32 139 141 253 357 458

Example n =8 (ordered)

28 32 139 141 175 253 357 458

m = 141

m = (141+175)2 = 158

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496049604971524555467586

1 5245

2 49655

3 4960

4 4971

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496052455546497155877586

1 5245

2 49655

3 5546

4 4971

Properties of Mean Median1The mean and median are unique that is a

data set has only 1 mean and 1 median (the mean and median are not necessarily equal)

2The mean uses the value of every number in the data set the median does not

14

20 4 6Ex 2 4 6 8 5 5

4 2

21 4 6Ex 2 4 6 9 5 5

4 2

x m

x m

Example class pulse rates

53 64 67 67 70 76 77 77 78 83 84 85 85 89 90 90 90 90 91 96 98 103 140

23

1

23

844823

location 12th obs 85

ii

n

xx

m m

2010 2014 baseball salaries

2010

n = 845

mean = $3297828

median = $1330000

max = $33000000

2014

n = 848

mean = $3932912

median = $1456250

max = $28000000

>

Disadvantage of the mean

Can be greatly influenced by just a few observations that are much greater or much smaller than the rest of the data

Mean Median Maximum Baseball Salaries 1985 - 201419

85

1987

1989

1991

1993

1995

1997

1999

2001

2003

2005

2007

2009

2011

2013

200000

700000

1200000

1700000

2200000

2700000

3200000

3700000

0

5000000

10000000

15000000

20000000

25000000

30000000

35000000

Baseball Salaries Mean Median and Maximum 1985-2014

Mean Median Maximum

Year

Mea

n M

edia

n S

alar

y

Max

imu

m S

alar

y

Skewness comparing the mean and median

Skewed to the right (positively skewed) meangtmedian

53

490

102 7235 21 26 17 8 10 2 3 1 0 0 1

0

100

200

300

400

500

600

Freq

uenc

y

Salary ($1000s)

2011 Baseball Salaries

Skewed to the left negatively skewed

Mean lt median mean=78 median=87

Histogram of Exam Scores

0

10

20

30

20 30 40 50 60 70 80 90 100Exam Scores

Fre

qu

en

cy

Symmetric data

mean median approx equal

Bank Customers 1000-1100 am

0

5

10

15

20

Number of Customers

Fre

qu

en

cy

Section 33Describing Variability of Data

Standard Deviation

Using the Mean and Standard Deviation Together 68-95-997

Rule (Empirical Rule)

Recall 2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability

measures how ldquospread outrdquo the data is

Ways to measure variability

1 range=largest-smallest

ok sometimes in general too crude sensitive to one large or small obs

1

2 where

the middle is the mean

deviation of from the mean

( ) sum the deviations of all the s from

measure spread from the middle

i i

n

i ii

y

y y y

y y y y

1

( ) 0 always tells us nothingn

ii

y y

Example

1 2

1 2

1 2

1 2

sum of deviations from mean

49 51 50

( ) ( ) (49 50) (51 50) 1 1 0

0 100

Data set 1

Data set 2 50

( ) ( ) (0 50) (100 50) 50 50 0

x x x

x x x x

y y y

y y y y

The Sample Standard Deviation a measure of spread around the mean Square the deviation of each

observation from the mean find the square root of the ldquoaveragerdquo of these squared deviations

2

1

2

2 1

( )sample standard deviation

1

( )is called the sample variance

1

n

ii

n

ii

y ys

n

y ys

n

Calculations hellip

Mean = 634

Sum of squared deviations from mean = 852

(n minus 1) = 13 (n minus 1) is called degrees freedom (df)

s2 = variance = 85213 = 655 square inches

s = standard deviation = radic655 = 256 inches

Women height (inches)i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

2

1

2 )(1

1xx

ns

n

i

1 First calculate the variance s22 Then take the square root to get the

standard deviation s

2

1

)(1

1xx

ns

n

i

Meanplusmn 1 sd

Wersquoll never calculate these by hand so make sure to know how to get the standard deviation using your calculator Excel or other software

Population Standard Deviation

2

1

Denoted by the lower case Greek letter

is the size (for example =34000 for NCSU)

is the mean

( )population standard deviation

va

po

lue of typically not known

us

pulation

populatio

e

n

N

ii

N N

y

N

s

to estimate value of

Remarks

1 The standard deviation of a set of measurements is an estimate of the likely size of the chance error in a single measurement

Remarks (cont)

2 Note that s and s are always greater than or equal to zero

3 The larger the value of s (or s ) the greater the spread of the data

When does s=0 When does s =0

When all data values are the same

Remarks (cont)4 The standard deviation is the most

commonly used measure of risk in finance and businessndash Stocks Mutual Funds etc

5 Variance s2 sample variance 2 population variance Units are squared units of the original data square $ square gallons

Review Properties of s and s s and s are always greater than or

equal to 0

when does s = 0 s = 0 The larger the value of s (or s) the

greater the spread of the data the standard deviation of a set of

measurements is an estimate of the likely size of the chance error in a single measurement

Summary of Notation

2

SAMPLE

sample mean

sample median

sample variance

sample stand dev

y

m

s

s

2

POPULATION

population mean

population median

population variance

population stand dev

m

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

68-95-997 rule

Mean andStandard Deviation

(numerical)

Histogram(graphical)

68-95-997 rule

The 68-95-997 ruleIf the histogram of the data is

approximately bell-shaped then1) approximately of the measurements

are of the mean

that is in ( )

2) approximately of the measurement

68

within 1 standard deviation

95

within 2 standard deviation

s

are of the meas n

that is

y s y s

almost all

within 3 standard deviation

in ( 2 2 )

3) the measurements

are of the mean

that is in ( 3 3 )

s

y s y s

y s y s

68-95-997 rule 68 within 1 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

68

3434

y-s y y+s

68-95-997 rule 95 within 2 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

95

475 475

y-2s y y+2s

Example textbook costs

37548

4272

50

y

s

n

286 291 307 308 315 316 327328 340 342 346 347 348 348 349354 355 355 360 361 364 367 369371 373 377 380 381 382 385 385387 390 390 397 398 409 409 410418 422 424 425 426 428 433 434437 440 480

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( ) (33276 41820)

32percentage of data values in this interval 64

5068-95-997 rule 68

y s

y s y s

1 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 2 2 ) (29004 46092)

48percentage of data values in this interval 96

5068-95-997 rule 95

y s

y s y s

2 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 3 3 ) (24732 50364)

50percentage of data values in this interval 100

5068-95-997 rule 997

y s

y s y s

3 standard deviation interval about the mean

The best estimate of the standard deviation of the menrsquos weights

displayed in this dotplot is

1 10

2 15

3 20

4 40

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

Preceding slides Next

Z-scores Standardized Data Values

Measures the distance of a number from the mean in units of

the standard deviation

z-score corresponding to y

where

original data value

the sample mean

s the sample standard deviation

the z-score corresponding to

y yz

s

y

y

z y

Exam 1 y1 = 88 s1 = 6 your exam 1 score 91

Exam 2 y2 = 88 s2 = 10 your exam 2 score 92

Which score is better

1

2

91 88 3z 5

6 692 88 4

z 410 10

91 on exam 1 is better than 92 on exam 2

If data has mean and standard deviation

then standardizing a particular value of

indicates how many standard deviations

is above or below the mean

y s

y

y

y

Comparing SAT and ACT Scores

SAT Math Eleanorrsquos score 680

SAT mean =500 sd=100 ACT Math Geraldrsquos score 27

ACT mean=18 sd=6 Eleanorrsquos z-score z=(680-500)100=18 Geraldrsquos z-score z=(27-18)6=15 Eleanorrsquos score is better

Z-scores add to zeroStudentInstitutional Support to Athletic Depts For the 9 Public ACC

Schools 2013 ($ millions)

School Support y - ybar Z-score

Maryland 155 64 179

UVA 131 40 112

Louisville 109 18 050

UNC 92 01 003

VaTech 79 -12 -034

FSU 79 -12 -034

GaTech 71 -20 -056

NCSU 65 -26 -073

Clemson 38 -53 -147

Mean=91000 s=35697

Sum = 0 Sum = 0

Recently the mean tuition at 4-yr public collegesuniversities in the US was $6185 with a standard deviation of $1804 In NC the mean tuition was $4320 What is NCrsquos z-score

1 103

2 -103

3 239

4 1865

5 -1865

Section 34Measures of Position (also called Measures of Relative Standing)

Quartiles

5-Number Summary

Interquartile Range Another Measure of Spread

Boxplots

m = median = 34

Q1= first quartile = 23

Q3= third quartile = 42

1 1 062 2 123 3 164 4 195 5 156 6 217 7 238 6 239 5 2510 4 2811 3 2912 2 3313 1 3414 2 3615 3 3716 4 3817 5 3918 6 4119 7 4220 6 4521 5 4722 4 4923 3 5324 2 5625 1 61

Quartiles Measuring spread by examining the middleThe first quartile Q1 is the value in the

sample that has 25 of the data at or

below it (Q1 is the median of the lower

half of the sorted data)

The third quartile Q3 is the value in the

sample that has 75 of the data at or

below it (Q3 is the median of the upper

half of the sorted data)

Quartiles and median divide data into 4 pieces

Q1 M Q3

14 14 14 14

Quartiles are common measures of spread

httpoirpncsueduiradmit

httpoirpncsueduunivpeer

University of Southern California

Economic Value of College Majors

Rules for Calculating QuartilesStep 1 find the median of all the data (the median divides the data in half)

Step 2a find the median of the lower half this median is Q1Step 2b find the median of the upper half this median is Q3

Importantwhen n is odd include the overall median in both halveswhen n is even do not include the overall median in either half

Example 2 4 6 8 10 12 14 16 18 20 n = 10

Median m = (10+12)2 = 222 = 11

Q1 median of lower half 2 4 6 8 10

Q1 = 6

Q3 median of upper half 12 14 16 18 20

Q3 = 16

11

Pulse Rates n = 138

Stem Leaves4

3 4 5889 5 00123344410 5 555678889923 6 0001111112223333334444423 6 5555666666777778888888816 7 0000011222233444423 7 5555566666677788888899910 8 000011222410 8 55556677894 9 00122 9 584 10 0223

101 11 1

Median mean of pulses in locations 69 amp 70 median= (70+70)2=70

Q1 median of lower half (lower half = 69 smallest pulses) Q1 = pulse in ordered position 35Q1 = 63

Q3 median of upper half (upper half = 69 largest pulses) Q3= pulse in position 35 from the high end Q3=78

Below are the weights of 31 linemen on the NCSU football team What is the

value of the first quartile Q1

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 287

2 2575

3 2635

4 2625

Interquartile range another measure of spread

lower quartile Q1

middle quartile median upper quartile Q3

interquartile range (IQR)

IQR = Q3 ndash Q1

measures spread of middle 50 of the data

Example beginning pulse rates

Q3 = 78 Q1 = 63

IQR = 78 ndash 63 = 15

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 47: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Word Wall (customer feedback)

Section 32Describing the Center of Data

Mean

Median

2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability (next section)

measures how ldquospread outrdquo the data is

Notation for Data Valuesand Sample Mean

1 2

1 2

3

The sample size is denoted by

For a variable denoted by its observations are denoted by

A common measure of center is the sample mean

The sample mean is denoted by

Shorte

n

n

y y yy

n

y

y y y y

y

n

1 21

1

ned expression for using the symbol

(uppercase Greek letter sigma)n

n

i

i n

i

i

y

y y y

yy

n

y

Simple Example of Sample Mean

Weekly TV viewing time in hours of 7 randomly selected 4th graders

19 40 16 12 10 6 and 97

1

7

1

19 40 16 12 10 6 9 112

11216

7 7

ii

ii

y

yy

Population Mean

1

population

population mea

Denoted by the Greek letter

is the size (for example =34000 for NCSU)

the value of is typically not known

we often use the sample mean

to estimat

n

e the unknown

N

ii

y

N N

y

N

value of

Connection Between Mean and Histogram

A histogram balances when supported at the mean Mean x = 1406

Histogram

0

10

20

30

40

50

60

70

118

5

125

5

132

5

139

5

146

5

153

5

16

05

Mo

re

Absences f rom Work

Fre

qu

en

cy

Frequency

The median anothermeasure of center

Given a set of n data values arranged in order of magnitude

Median= middle value n odd

mean of 2 middle values n even

Ex 2 4 6 8 10 n=5 median=6 Ex 2 4 6 8 n=4 median=(4+6)2=5

Student Pulse Rates (n=62)

38 59 60 60 62 62 63 63 64 64 65 67 68 70 70 70 70 70 70 70 71 71 72 72 73 74 74 75 75 75 75 76 77 77 77 77 78 78 79 79 80 80 80 84 84 85 85 87 90 90 91 92 93 94 94 95 96 96 96 98 98 103

Median = (75+76)2 = 755

The median splits the histogram into 2 halves of equal area

Mean balance pointMedian 50 area each half

mean 5526 years median 577years

Medians are used often

Year 2011 baseball salaries

Median $1450000 (max=$32000000 Alex Rodriguez min=$414000)

Median fan age MLB 45 NFL 43 NBA 41 NHL 39

Median existing home sales price May 2011 $166500 May 2010 $174600

Median household income (2008 dollars) 2009 $50221 2008 $52029

Examples Example n = 7

175 28 32 139 141 253 458 Example n = 7 (ordered) 28 32 139 141 175 253 458 Example n = 8

175 28 32 139 141 253 357 458

Example n =8 (ordered)

28 32 139 141 175 253 357 458

m = 141

m = (141+175)2 = 158

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496049604971524555467586

1 5245

2 49655

3 4960

4 4971

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496052455546497155877586

1 5245

2 49655

3 5546

4 4971

Properties of Mean Median1The mean and median are unique that is a

data set has only 1 mean and 1 median (the mean and median are not necessarily equal)

2The mean uses the value of every number in the data set the median does not

14

20 4 6Ex 2 4 6 8 5 5

4 2

21 4 6Ex 2 4 6 9 5 5

4 2

x m

x m

Example class pulse rates

53 64 67 67 70 76 77 77 78 83 84 85 85 89 90 90 90 90 91 96 98 103 140

23

1

23

844823

location 12th obs 85

ii

n

xx

m m

2010 2014 baseball salaries

2010

n = 845

mean = $3297828

median = $1330000

max = $33000000

2014

n = 848

mean = $3932912

median = $1456250

max = $28000000

>

Disadvantage of the mean

Can be greatly influenced by just a few observations that are much greater or much smaller than the rest of the data

Mean Median Maximum Baseball Salaries 1985 - 201419

85

1987

1989

1991

1993

1995

1997

1999

2001

2003

2005

2007

2009

2011

2013

200000

700000

1200000

1700000

2200000

2700000

3200000

3700000

0

5000000

10000000

15000000

20000000

25000000

30000000

35000000

Baseball Salaries Mean Median and Maximum 1985-2014

Mean Median Maximum

Year

Mea

n M

edia

n S

alar

y

Max

imu

m S

alar

y

Skewness comparing the mean and median

Skewed to the right (positively skewed) meangtmedian

53

490

102 7235 21 26 17 8 10 2 3 1 0 0 1

0

100

200

300

400

500

600

Freq

uenc

y

Salary ($1000s)

2011 Baseball Salaries

Skewed to the left negatively skewed

Mean lt median mean=78 median=87

Histogram of Exam Scores

0

10

20

30

20 30 40 50 60 70 80 90 100Exam Scores

Fre

qu

en

cy

Symmetric data

mean median approx equal

Bank Customers 1000-1100 am

0

5

10

15

20

Number of Customers

Fre

qu

en

cy

Section 33Describing Variability of Data

Standard Deviation

Using the Mean and Standard Deviation Together 68-95-997

Rule (Empirical Rule)

Recall 2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability

measures how ldquospread outrdquo the data is

Ways to measure variability

1 range=largest-smallest

ok sometimes in general too crude sensitive to one large or small obs

1

2 where

the middle is the mean

deviation of from the mean

( ) sum the deviations of all the s from

measure spread from the middle

i i

n

i ii

y

y y y

y y y y

1

( ) 0 always tells us nothingn

ii

y y

Example

1 2

1 2

1 2

1 2

sum of deviations from mean

49 51 50

( ) ( ) (49 50) (51 50) 1 1 0

0 100

Data set 1

Data set 2 50

( ) ( ) (0 50) (100 50) 50 50 0

x x x

x x x x

y y y

y y y y

The Sample Standard Deviation a measure of spread around the mean Square the deviation of each

observation from the mean find the square root of the ldquoaveragerdquo of these squared deviations

2

1

2

2 1

( )sample standard deviation

1

( )is called the sample variance

1

n

ii

n

ii

y ys

n

y ys

n

Calculations hellip

Mean = 634

Sum of squared deviations from mean = 852

(n minus 1) = 13 (n minus 1) is called degrees freedom (df)

s2 = variance = 85213 = 655 square inches

s = standard deviation = radic655 = 256 inches

Women height (inches)i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

2

1

2 )(1

1xx

ns

n

i

1 First calculate the variance s22 Then take the square root to get the

standard deviation s

2

1

)(1

1xx

ns

n

i

Meanplusmn 1 sd

Wersquoll never calculate these by hand so make sure to know how to get the standard deviation using your calculator Excel or other software

Population Standard Deviation

2

1

Denoted by the lower case Greek letter

is the size (for example =34000 for NCSU)

is the mean

( )population standard deviation

va

po

lue of typically not known

us

pulation

populatio

e

n

N

ii

N N

y

N

s

to estimate value of

Remarks

1 The standard deviation of a set of measurements is an estimate of the likely size of the chance error in a single measurement

Remarks (cont)

2 Note that s and s are always greater than or equal to zero

3 The larger the value of s (or s ) the greater the spread of the data

When does s=0 When does s =0

When all data values are the same

Remarks (cont)4 The standard deviation is the most

commonly used measure of risk in finance and businessndash Stocks Mutual Funds etc

5 Variance s2 sample variance 2 population variance Units are squared units of the original data square $ square gallons

Review Properties of s and s s and s are always greater than or

equal to 0

when does s = 0 s = 0 The larger the value of s (or s) the

greater the spread of the data the standard deviation of a set of

measurements is an estimate of the likely size of the chance error in a single measurement

Summary of Notation

2

SAMPLE

sample mean

sample median

sample variance

sample stand dev

y

m

s

s

2

POPULATION

population mean

population median

population variance

population stand dev

m

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

68-95-997 rule

Mean andStandard Deviation

(numerical)

Histogram(graphical)

68-95-997 rule

The 68-95-997 ruleIf the histogram of the data is

approximately bell-shaped then1) approximately of the measurements

are of the mean

that is in ( )

2) approximately of the measurement

68

within 1 standard deviation

95

within 2 standard deviation

s

are of the meas n

that is

y s y s

almost all

within 3 standard deviation

in ( 2 2 )

3) the measurements

are of the mean

that is in ( 3 3 )

s

y s y s

y s y s

68-95-997 rule 68 within 1 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

68

3434

y-s y y+s

68-95-997 rule 95 within 2 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

95

475 475

y-2s y y+2s

Example textbook costs

37548

4272

50

y

s

n

286 291 307 308 315 316 327328 340 342 346 347 348 348 349354 355 355 360 361 364 367 369371 373 377 380 381 382 385 385387 390 390 397 398 409 409 410418 422 424 425 426 428 433 434437 440 480

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( ) (33276 41820)

32percentage of data values in this interval 64

5068-95-997 rule 68

y s

y s y s

1 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 2 2 ) (29004 46092)

48percentage of data values in this interval 96

5068-95-997 rule 95

y s

y s y s

2 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 3 3 ) (24732 50364)

50percentage of data values in this interval 100

5068-95-997 rule 997

y s

y s y s

3 standard deviation interval about the mean

The best estimate of the standard deviation of the menrsquos weights

displayed in this dotplot is

1 10

2 15

3 20

4 40

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

Preceding slides Next

Z-scores Standardized Data Values

Measures the distance of a number from the mean in units of

the standard deviation

z-score corresponding to y

where

original data value

the sample mean

s the sample standard deviation

the z-score corresponding to

y yz

s

y

y

z y

Exam 1 y1 = 88 s1 = 6 your exam 1 score 91

Exam 2 y2 = 88 s2 = 10 your exam 2 score 92

Which score is better

1

2

91 88 3z 5

6 692 88 4

z 410 10

91 on exam 1 is better than 92 on exam 2

If data has mean and standard deviation

then standardizing a particular value of

indicates how many standard deviations

is above or below the mean

y s

y

y

y

Comparing SAT and ACT Scores

SAT Math Eleanorrsquos score 680

SAT mean =500 sd=100 ACT Math Geraldrsquos score 27

ACT mean=18 sd=6 Eleanorrsquos z-score z=(680-500)100=18 Geraldrsquos z-score z=(27-18)6=15 Eleanorrsquos score is better

Z-scores add to zeroStudentInstitutional Support to Athletic Depts For the 9 Public ACC

Schools 2013 ($ millions)

School Support y - ybar Z-score

Maryland 155 64 179

UVA 131 40 112

Louisville 109 18 050

UNC 92 01 003

VaTech 79 -12 -034

FSU 79 -12 -034

GaTech 71 -20 -056

NCSU 65 -26 -073

Clemson 38 -53 -147

Mean=91000 s=35697

Sum = 0 Sum = 0

Recently the mean tuition at 4-yr public collegesuniversities in the US was $6185 with a standard deviation of $1804 In NC the mean tuition was $4320 What is NCrsquos z-score

1 103

2 -103

3 239

4 1865

5 -1865

Section 34Measures of Position (also called Measures of Relative Standing)

Quartiles

5-Number Summary

Interquartile Range Another Measure of Spread

Boxplots

m = median = 34

Q1= first quartile = 23

Q3= third quartile = 42

1 1 062 2 123 3 164 4 195 5 156 6 217 7 238 6 239 5 2510 4 2811 3 2912 2 3313 1 3414 2 3615 3 3716 4 3817 5 3918 6 4119 7 4220 6 4521 5 4722 4 4923 3 5324 2 5625 1 61

Quartiles Measuring spread by examining the middleThe first quartile Q1 is the value in the

sample that has 25 of the data at or

below it (Q1 is the median of the lower

half of the sorted data)

The third quartile Q3 is the value in the

sample that has 75 of the data at or

below it (Q3 is the median of the upper

half of the sorted data)

Quartiles and median divide data into 4 pieces

Q1 M Q3

14 14 14 14

Quartiles are common measures of spread

httpoirpncsueduiradmit

httpoirpncsueduunivpeer

University of Southern California

Economic Value of College Majors

Rules for Calculating QuartilesStep 1 find the median of all the data (the median divides the data in half)

Step 2a find the median of the lower half this median is Q1Step 2b find the median of the upper half this median is Q3

Importantwhen n is odd include the overall median in both halveswhen n is even do not include the overall median in either half

Example 2 4 6 8 10 12 14 16 18 20 n = 10

Median m = (10+12)2 = 222 = 11

Q1 median of lower half 2 4 6 8 10

Q1 = 6

Q3 median of upper half 12 14 16 18 20

Q3 = 16

11

Pulse Rates n = 138

Stem Leaves4

3 4 5889 5 00123344410 5 555678889923 6 0001111112223333334444423 6 5555666666777778888888816 7 0000011222233444423 7 5555566666677788888899910 8 000011222410 8 55556677894 9 00122 9 584 10 0223

101 11 1

Median mean of pulses in locations 69 amp 70 median= (70+70)2=70

Q1 median of lower half (lower half = 69 smallest pulses) Q1 = pulse in ordered position 35Q1 = 63

Q3 median of upper half (upper half = 69 largest pulses) Q3= pulse in position 35 from the high end Q3=78

Below are the weights of 31 linemen on the NCSU football team What is the

value of the first quartile Q1

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 287

2 2575

3 2635

4 2625

Interquartile range another measure of spread

lower quartile Q1

middle quartile median upper quartile Q3

interquartile range (IQR)

IQR = Q3 ndash Q1

measures spread of middle 50 of the data

Example beginning pulse rates

Q3 = 78 Q1 = 63

IQR = 78 ndash 63 = 15

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 48: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Section 32Describing the Center of Data

Mean

Median

2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability (next section)

measures how ldquospread outrdquo the data is

Notation for Data Valuesand Sample Mean

1 2

1 2

3

The sample size is denoted by

For a variable denoted by its observations are denoted by

A common measure of center is the sample mean

The sample mean is denoted by

Shorte

n

n

y y yy

n

y

y y y y

y

n

1 21

1

ned expression for using the symbol

(uppercase Greek letter sigma)n

n

i

i n

i

i

y

y y y

yy

n

y

Simple Example of Sample Mean

Weekly TV viewing time in hours of 7 randomly selected 4th graders

19 40 16 12 10 6 and 97

1

7

1

19 40 16 12 10 6 9 112

11216

7 7

ii

ii

y

yy

Population Mean

1

population

population mea

Denoted by the Greek letter

is the size (for example =34000 for NCSU)

the value of is typically not known

we often use the sample mean

to estimat

n

e the unknown

N

ii

y

N N

y

N

value of

Connection Between Mean and Histogram

A histogram balances when supported at the mean Mean x = 1406

Histogram

0

10

20

30

40

50

60

70

118

5

125

5

132

5

139

5

146

5

153

5

16

05

Mo

re

Absences f rom Work

Fre

qu

en

cy

Frequency

The median anothermeasure of center

Given a set of n data values arranged in order of magnitude

Median= middle value n odd

mean of 2 middle values n even

Ex 2 4 6 8 10 n=5 median=6 Ex 2 4 6 8 n=4 median=(4+6)2=5

Student Pulse Rates (n=62)

38 59 60 60 62 62 63 63 64 64 65 67 68 70 70 70 70 70 70 70 71 71 72 72 73 74 74 75 75 75 75 76 77 77 77 77 78 78 79 79 80 80 80 84 84 85 85 87 90 90 91 92 93 94 94 95 96 96 96 98 98 103

Median = (75+76)2 = 755

The median splits the histogram into 2 halves of equal area

Mean balance pointMedian 50 area each half

mean 5526 years median 577years

Medians are used often

Year 2011 baseball salaries

Median $1450000 (max=$32000000 Alex Rodriguez min=$414000)

Median fan age MLB 45 NFL 43 NBA 41 NHL 39

Median existing home sales price May 2011 $166500 May 2010 $174600

Median household income (2008 dollars) 2009 $50221 2008 $52029

Examples Example n = 7

175 28 32 139 141 253 458 Example n = 7 (ordered) 28 32 139 141 175 253 458 Example n = 8

175 28 32 139 141 253 357 458

Example n =8 (ordered)

28 32 139 141 175 253 357 458

m = 141

m = (141+175)2 = 158

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496049604971524555467586

1 5245

2 49655

3 4960

4 4971

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496052455546497155877586

1 5245

2 49655

3 5546

4 4971

Properties of Mean Median1The mean and median are unique that is a

data set has only 1 mean and 1 median (the mean and median are not necessarily equal)

2The mean uses the value of every number in the data set the median does not

14

20 4 6Ex 2 4 6 8 5 5

4 2

21 4 6Ex 2 4 6 9 5 5

4 2

x m

x m

Example class pulse rates

53 64 67 67 70 76 77 77 78 83 84 85 85 89 90 90 90 90 91 96 98 103 140

23

1

23

844823

location 12th obs 85

ii

n

xx

m m

2010 2014 baseball salaries

2010

n = 845

mean = $3297828

median = $1330000

max = $33000000

2014

n = 848

mean = $3932912

median = $1456250

max = $28000000

>

Disadvantage of the mean

Can be greatly influenced by just a few observations that are much greater or much smaller than the rest of the data

Mean Median Maximum Baseball Salaries 1985 - 201419

85

1987

1989

1991

1993

1995

1997

1999

2001

2003

2005

2007

2009

2011

2013

200000

700000

1200000

1700000

2200000

2700000

3200000

3700000

0

5000000

10000000

15000000

20000000

25000000

30000000

35000000

Baseball Salaries Mean Median and Maximum 1985-2014

Mean Median Maximum

Year

Mea

n M

edia

n S

alar

y

Max

imu

m S

alar

y

Skewness comparing the mean and median

Skewed to the right (positively skewed) meangtmedian

53

490

102 7235 21 26 17 8 10 2 3 1 0 0 1

0

100

200

300

400

500

600

Freq

uenc

y

Salary ($1000s)

2011 Baseball Salaries

Skewed to the left negatively skewed

Mean lt median mean=78 median=87

Histogram of Exam Scores

0

10

20

30

20 30 40 50 60 70 80 90 100Exam Scores

Fre

qu

en

cy

Symmetric data

mean median approx equal

Bank Customers 1000-1100 am

0

5

10

15

20

Number of Customers

Fre

qu

en

cy

Section 33Describing Variability of Data

Standard Deviation

Using the Mean and Standard Deviation Together 68-95-997

Rule (Empirical Rule)

Recall 2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability

measures how ldquospread outrdquo the data is

Ways to measure variability

1 range=largest-smallest

ok sometimes in general too crude sensitive to one large or small obs

1

2 where

the middle is the mean

deviation of from the mean

( ) sum the deviations of all the s from

measure spread from the middle

i i

n

i ii

y

y y y

y y y y

1

( ) 0 always tells us nothingn

ii

y y

Example

1 2

1 2

1 2

1 2

sum of deviations from mean

49 51 50

( ) ( ) (49 50) (51 50) 1 1 0

0 100

Data set 1

Data set 2 50

( ) ( ) (0 50) (100 50) 50 50 0

x x x

x x x x

y y y

y y y y

The Sample Standard Deviation a measure of spread around the mean Square the deviation of each

observation from the mean find the square root of the ldquoaveragerdquo of these squared deviations

2

1

2

2 1

( )sample standard deviation

1

( )is called the sample variance

1

n

ii

n

ii

y ys

n

y ys

n

Calculations hellip

Mean = 634

Sum of squared deviations from mean = 852

(n minus 1) = 13 (n minus 1) is called degrees freedom (df)

s2 = variance = 85213 = 655 square inches

s = standard deviation = radic655 = 256 inches

Women height (inches)i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

2

1

2 )(1

1xx

ns

n

i

1 First calculate the variance s22 Then take the square root to get the

standard deviation s

2

1

)(1

1xx

ns

n

i

Meanplusmn 1 sd

Wersquoll never calculate these by hand so make sure to know how to get the standard deviation using your calculator Excel or other software

Population Standard Deviation

2

1

Denoted by the lower case Greek letter

is the size (for example =34000 for NCSU)

is the mean

( )population standard deviation

va

po

lue of typically not known

us

pulation

populatio

e

n

N

ii

N N

y

N

s

to estimate value of

Remarks

1 The standard deviation of a set of measurements is an estimate of the likely size of the chance error in a single measurement

Remarks (cont)

2 Note that s and s are always greater than or equal to zero

3 The larger the value of s (or s ) the greater the spread of the data

When does s=0 When does s =0

When all data values are the same

Remarks (cont)4 The standard deviation is the most

commonly used measure of risk in finance and businessndash Stocks Mutual Funds etc

5 Variance s2 sample variance 2 population variance Units are squared units of the original data square $ square gallons

Review Properties of s and s s and s are always greater than or

equal to 0

when does s = 0 s = 0 The larger the value of s (or s) the

greater the spread of the data the standard deviation of a set of

measurements is an estimate of the likely size of the chance error in a single measurement

Summary of Notation

2

SAMPLE

sample mean

sample median

sample variance

sample stand dev

y

m

s

s

2

POPULATION

population mean

population median

population variance

population stand dev

m

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

68-95-997 rule

Mean andStandard Deviation

(numerical)

Histogram(graphical)

68-95-997 rule

The 68-95-997 ruleIf the histogram of the data is

approximately bell-shaped then1) approximately of the measurements

are of the mean

that is in ( )

2) approximately of the measurement

68

within 1 standard deviation

95

within 2 standard deviation

s

are of the meas n

that is

y s y s

almost all

within 3 standard deviation

in ( 2 2 )

3) the measurements

are of the mean

that is in ( 3 3 )

s

y s y s

y s y s

68-95-997 rule 68 within 1 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

68

3434

y-s y y+s

68-95-997 rule 95 within 2 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

95

475 475

y-2s y y+2s

Example textbook costs

37548

4272

50

y

s

n

286 291 307 308 315 316 327328 340 342 346 347 348 348 349354 355 355 360 361 364 367 369371 373 377 380 381 382 385 385387 390 390 397 398 409 409 410418 422 424 425 426 428 433 434437 440 480

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( ) (33276 41820)

32percentage of data values in this interval 64

5068-95-997 rule 68

y s

y s y s

1 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 2 2 ) (29004 46092)

48percentage of data values in this interval 96

5068-95-997 rule 95

y s

y s y s

2 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 3 3 ) (24732 50364)

50percentage of data values in this interval 100

5068-95-997 rule 997

y s

y s y s

3 standard deviation interval about the mean

The best estimate of the standard deviation of the menrsquos weights

displayed in this dotplot is

1 10

2 15

3 20

4 40

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

Preceding slides Next

Z-scores Standardized Data Values

Measures the distance of a number from the mean in units of

the standard deviation

z-score corresponding to y

where

original data value

the sample mean

s the sample standard deviation

the z-score corresponding to

y yz

s

y

y

z y

Exam 1 y1 = 88 s1 = 6 your exam 1 score 91

Exam 2 y2 = 88 s2 = 10 your exam 2 score 92

Which score is better

1

2

91 88 3z 5

6 692 88 4

z 410 10

91 on exam 1 is better than 92 on exam 2

If data has mean and standard deviation

then standardizing a particular value of

indicates how many standard deviations

is above or below the mean

y s

y

y

y

Comparing SAT and ACT Scores

SAT Math Eleanorrsquos score 680

SAT mean =500 sd=100 ACT Math Geraldrsquos score 27

ACT mean=18 sd=6 Eleanorrsquos z-score z=(680-500)100=18 Geraldrsquos z-score z=(27-18)6=15 Eleanorrsquos score is better

Z-scores add to zeroStudentInstitutional Support to Athletic Depts For the 9 Public ACC

Schools 2013 ($ millions)

School Support y - ybar Z-score

Maryland 155 64 179

UVA 131 40 112

Louisville 109 18 050

UNC 92 01 003

VaTech 79 -12 -034

FSU 79 -12 -034

GaTech 71 -20 -056

NCSU 65 -26 -073

Clemson 38 -53 -147

Mean=91000 s=35697

Sum = 0 Sum = 0

Recently the mean tuition at 4-yr public collegesuniversities in the US was $6185 with a standard deviation of $1804 In NC the mean tuition was $4320 What is NCrsquos z-score

1 103

2 -103

3 239

4 1865

5 -1865

Section 34Measures of Position (also called Measures of Relative Standing)

Quartiles

5-Number Summary

Interquartile Range Another Measure of Spread

Boxplots

m = median = 34

Q1= first quartile = 23

Q3= third quartile = 42

1 1 062 2 123 3 164 4 195 5 156 6 217 7 238 6 239 5 2510 4 2811 3 2912 2 3313 1 3414 2 3615 3 3716 4 3817 5 3918 6 4119 7 4220 6 4521 5 4722 4 4923 3 5324 2 5625 1 61

Quartiles Measuring spread by examining the middleThe first quartile Q1 is the value in the

sample that has 25 of the data at or

below it (Q1 is the median of the lower

half of the sorted data)

The third quartile Q3 is the value in the

sample that has 75 of the data at or

below it (Q3 is the median of the upper

half of the sorted data)

Quartiles and median divide data into 4 pieces

Q1 M Q3

14 14 14 14

Quartiles are common measures of spread

httpoirpncsueduiradmit

httpoirpncsueduunivpeer

University of Southern California

Economic Value of College Majors

Rules for Calculating QuartilesStep 1 find the median of all the data (the median divides the data in half)

Step 2a find the median of the lower half this median is Q1Step 2b find the median of the upper half this median is Q3

Importantwhen n is odd include the overall median in both halveswhen n is even do not include the overall median in either half

Example 2 4 6 8 10 12 14 16 18 20 n = 10

Median m = (10+12)2 = 222 = 11

Q1 median of lower half 2 4 6 8 10

Q1 = 6

Q3 median of upper half 12 14 16 18 20

Q3 = 16

11

Pulse Rates n = 138

Stem Leaves4

3 4 5889 5 00123344410 5 555678889923 6 0001111112223333334444423 6 5555666666777778888888816 7 0000011222233444423 7 5555566666677788888899910 8 000011222410 8 55556677894 9 00122 9 584 10 0223

101 11 1

Median mean of pulses in locations 69 amp 70 median= (70+70)2=70

Q1 median of lower half (lower half = 69 smallest pulses) Q1 = pulse in ordered position 35Q1 = 63

Q3 median of upper half (upper half = 69 largest pulses) Q3= pulse in position 35 from the high end Q3=78

Below are the weights of 31 linemen on the NCSU football team What is the

value of the first quartile Q1

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 287

2 2575

3 2635

4 2625

Interquartile range another measure of spread

lower quartile Q1

middle quartile median upper quartile Q3

interquartile range (IQR)

IQR = Q3 ndash Q1

measures spread of middle 50 of the data

Example beginning pulse rates

Q3 = 78 Q1 = 63

IQR = 78 ndash 63 = 15

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 49: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability (next section)

measures how ldquospread outrdquo the data is

Notation for Data Valuesand Sample Mean

1 2

1 2

3

The sample size is denoted by

For a variable denoted by its observations are denoted by

A common measure of center is the sample mean

The sample mean is denoted by

Shorte

n

n

y y yy

n

y

y y y y

y

n

1 21

1

ned expression for using the symbol

(uppercase Greek letter sigma)n

n

i

i n

i

i

y

y y y

yy

n

y

Simple Example of Sample Mean

Weekly TV viewing time in hours of 7 randomly selected 4th graders

19 40 16 12 10 6 and 97

1

7

1

19 40 16 12 10 6 9 112

11216

7 7

ii

ii

y

yy

Population Mean

1

population

population mea

Denoted by the Greek letter

is the size (for example =34000 for NCSU)

the value of is typically not known

we often use the sample mean

to estimat

n

e the unknown

N

ii

y

N N

y

N

value of

Connection Between Mean and Histogram

A histogram balances when supported at the mean Mean x = 1406

Histogram

0

10

20

30

40

50

60

70

118

5

125

5

132

5

139

5

146

5

153

5

16

05

Mo

re

Absences f rom Work

Fre

qu

en

cy

Frequency

The median anothermeasure of center

Given a set of n data values arranged in order of magnitude

Median= middle value n odd

mean of 2 middle values n even

Ex 2 4 6 8 10 n=5 median=6 Ex 2 4 6 8 n=4 median=(4+6)2=5

Student Pulse Rates (n=62)

38 59 60 60 62 62 63 63 64 64 65 67 68 70 70 70 70 70 70 70 71 71 72 72 73 74 74 75 75 75 75 76 77 77 77 77 78 78 79 79 80 80 80 84 84 85 85 87 90 90 91 92 93 94 94 95 96 96 96 98 98 103

Median = (75+76)2 = 755

The median splits the histogram into 2 halves of equal area

Mean balance pointMedian 50 area each half

mean 5526 years median 577years

Medians are used often

Year 2011 baseball salaries

Median $1450000 (max=$32000000 Alex Rodriguez min=$414000)

Median fan age MLB 45 NFL 43 NBA 41 NHL 39

Median existing home sales price May 2011 $166500 May 2010 $174600

Median household income (2008 dollars) 2009 $50221 2008 $52029

Examples Example n = 7

175 28 32 139 141 253 458 Example n = 7 (ordered) 28 32 139 141 175 253 458 Example n = 8

175 28 32 139 141 253 357 458

Example n =8 (ordered)

28 32 139 141 175 253 357 458

m = 141

m = (141+175)2 = 158

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496049604971524555467586

1 5245

2 49655

3 4960

4 4971

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496052455546497155877586

1 5245

2 49655

3 5546

4 4971

Properties of Mean Median1The mean and median are unique that is a

data set has only 1 mean and 1 median (the mean and median are not necessarily equal)

2The mean uses the value of every number in the data set the median does not

14

20 4 6Ex 2 4 6 8 5 5

4 2

21 4 6Ex 2 4 6 9 5 5

4 2

x m

x m

Example class pulse rates

53 64 67 67 70 76 77 77 78 83 84 85 85 89 90 90 90 90 91 96 98 103 140

23

1

23

844823

location 12th obs 85

ii

n

xx

m m

2010 2014 baseball salaries

2010

n = 845

mean = $3297828

median = $1330000

max = $33000000

2014

n = 848

mean = $3932912

median = $1456250

max = $28000000

>

Disadvantage of the mean

Can be greatly influenced by just a few observations that are much greater or much smaller than the rest of the data

Mean Median Maximum Baseball Salaries 1985 - 201419

85

1987

1989

1991

1993

1995

1997

1999

2001

2003

2005

2007

2009

2011

2013

200000

700000

1200000

1700000

2200000

2700000

3200000

3700000

0

5000000

10000000

15000000

20000000

25000000

30000000

35000000

Baseball Salaries Mean Median and Maximum 1985-2014

Mean Median Maximum

Year

Mea

n M

edia

n S

alar

y

Max

imu

m S

alar

y

Skewness comparing the mean and median

Skewed to the right (positively skewed) meangtmedian

53

490

102 7235 21 26 17 8 10 2 3 1 0 0 1

0

100

200

300

400

500

600

Freq

uenc

y

Salary ($1000s)

2011 Baseball Salaries

Skewed to the left negatively skewed

Mean lt median mean=78 median=87

Histogram of Exam Scores

0

10

20

30

20 30 40 50 60 70 80 90 100Exam Scores

Fre

qu

en

cy

Symmetric data

mean median approx equal

Bank Customers 1000-1100 am

0

5

10

15

20

Number of Customers

Fre

qu

en

cy

Section 33Describing Variability of Data

Standard Deviation

Using the Mean and Standard Deviation Together 68-95-997

Rule (Empirical Rule)

Recall 2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability

measures how ldquospread outrdquo the data is

Ways to measure variability

1 range=largest-smallest

ok sometimes in general too crude sensitive to one large or small obs

1

2 where

the middle is the mean

deviation of from the mean

( ) sum the deviations of all the s from

measure spread from the middle

i i

n

i ii

y

y y y

y y y y

1

( ) 0 always tells us nothingn

ii

y y

Example

1 2

1 2

1 2

1 2

sum of deviations from mean

49 51 50

( ) ( ) (49 50) (51 50) 1 1 0

0 100

Data set 1

Data set 2 50

( ) ( ) (0 50) (100 50) 50 50 0

x x x

x x x x

y y y

y y y y

The Sample Standard Deviation a measure of spread around the mean Square the deviation of each

observation from the mean find the square root of the ldquoaveragerdquo of these squared deviations

2

1

2

2 1

( )sample standard deviation

1

( )is called the sample variance

1

n

ii

n

ii

y ys

n

y ys

n

Calculations hellip

Mean = 634

Sum of squared deviations from mean = 852

(n minus 1) = 13 (n minus 1) is called degrees freedom (df)

s2 = variance = 85213 = 655 square inches

s = standard deviation = radic655 = 256 inches

Women height (inches)i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

2

1

2 )(1

1xx

ns

n

i

1 First calculate the variance s22 Then take the square root to get the

standard deviation s

2

1

)(1

1xx

ns

n

i

Meanplusmn 1 sd

Wersquoll never calculate these by hand so make sure to know how to get the standard deviation using your calculator Excel or other software

Population Standard Deviation

2

1

Denoted by the lower case Greek letter

is the size (for example =34000 for NCSU)

is the mean

( )population standard deviation

va

po

lue of typically not known

us

pulation

populatio

e

n

N

ii

N N

y

N

s

to estimate value of

Remarks

1 The standard deviation of a set of measurements is an estimate of the likely size of the chance error in a single measurement

Remarks (cont)

2 Note that s and s are always greater than or equal to zero

3 The larger the value of s (or s ) the greater the spread of the data

When does s=0 When does s =0

When all data values are the same

Remarks (cont)4 The standard deviation is the most

commonly used measure of risk in finance and businessndash Stocks Mutual Funds etc

5 Variance s2 sample variance 2 population variance Units are squared units of the original data square $ square gallons

Review Properties of s and s s and s are always greater than or

equal to 0

when does s = 0 s = 0 The larger the value of s (or s) the

greater the spread of the data the standard deviation of a set of

measurements is an estimate of the likely size of the chance error in a single measurement

Summary of Notation

2

SAMPLE

sample mean

sample median

sample variance

sample stand dev

y

m

s

s

2

POPULATION

population mean

population median

population variance

population stand dev

m

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

68-95-997 rule

Mean andStandard Deviation

(numerical)

Histogram(graphical)

68-95-997 rule

The 68-95-997 ruleIf the histogram of the data is

approximately bell-shaped then1) approximately of the measurements

are of the mean

that is in ( )

2) approximately of the measurement

68

within 1 standard deviation

95

within 2 standard deviation

s

are of the meas n

that is

y s y s

almost all

within 3 standard deviation

in ( 2 2 )

3) the measurements

are of the mean

that is in ( 3 3 )

s

y s y s

y s y s

68-95-997 rule 68 within 1 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

68

3434

y-s y y+s

68-95-997 rule 95 within 2 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

95

475 475

y-2s y y+2s

Example textbook costs

37548

4272

50

y

s

n

286 291 307 308 315 316 327328 340 342 346 347 348 348 349354 355 355 360 361 364 367 369371 373 377 380 381 382 385 385387 390 390 397 398 409 409 410418 422 424 425 426 428 433 434437 440 480

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( ) (33276 41820)

32percentage of data values in this interval 64

5068-95-997 rule 68

y s

y s y s

1 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 2 2 ) (29004 46092)

48percentage of data values in this interval 96

5068-95-997 rule 95

y s

y s y s

2 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 3 3 ) (24732 50364)

50percentage of data values in this interval 100

5068-95-997 rule 997

y s

y s y s

3 standard deviation interval about the mean

The best estimate of the standard deviation of the menrsquos weights

displayed in this dotplot is

1 10

2 15

3 20

4 40

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

Preceding slides Next

Z-scores Standardized Data Values

Measures the distance of a number from the mean in units of

the standard deviation

z-score corresponding to y

where

original data value

the sample mean

s the sample standard deviation

the z-score corresponding to

y yz

s

y

y

z y

Exam 1 y1 = 88 s1 = 6 your exam 1 score 91

Exam 2 y2 = 88 s2 = 10 your exam 2 score 92

Which score is better

1

2

91 88 3z 5

6 692 88 4

z 410 10

91 on exam 1 is better than 92 on exam 2

If data has mean and standard deviation

then standardizing a particular value of

indicates how many standard deviations

is above or below the mean

y s

y

y

y

Comparing SAT and ACT Scores

SAT Math Eleanorrsquos score 680

SAT mean =500 sd=100 ACT Math Geraldrsquos score 27

ACT mean=18 sd=6 Eleanorrsquos z-score z=(680-500)100=18 Geraldrsquos z-score z=(27-18)6=15 Eleanorrsquos score is better

Z-scores add to zeroStudentInstitutional Support to Athletic Depts For the 9 Public ACC

Schools 2013 ($ millions)

School Support y - ybar Z-score

Maryland 155 64 179

UVA 131 40 112

Louisville 109 18 050

UNC 92 01 003

VaTech 79 -12 -034

FSU 79 -12 -034

GaTech 71 -20 -056

NCSU 65 -26 -073

Clemson 38 -53 -147

Mean=91000 s=35697

Sum = 0 Sum = 0

Recently the mean tuition at 4-yr public collegesuniversities in the US was $6185 with a standard deviation of $1804 In NC the mean tuition was $4320 What is NCrsquos z-score

1 103

2 -103

3 239

4 1865

5 -1865

Section 34Measures of Position (also called Measures of Relative Standing)

Quartiles

5-Number Summary

Interquartile Range Another Measure of Spread

Boxplots

m = median = 34

Q1= first quartile = 23

Q3= third quartile = 42

1 1 062 2 123 3 164 4 195 5 156 6 217 7 238 6 239 5 2510 4 2811 3 2912 2 3313 1 3414 2 3615 3 3716 4 3817 5 3918 6 4119 7 4220 6 4521 5 4722 4 4923 3 5324 2 5625 1 61

Quartiles Measuring spread by examining the middleThe first quartile Q1 is the value in the

sample that has 25 of the data at or

below it (Q1 is the median of the lower

half of the sorted data)

The third quartile Q3 is the value in the

sample that has 75 of the data at or

below it (Q3 is the median of the upper

half of the sorted data)

Quartiles and median divide data into 4 pieces

Q1 M Q3

14 14 14 14

Quartiles are common measures of spread

httpoirpncsueduiradmit

httpoirpncsueduunivpeer

University of Southern California

Economic Value of College Majors

Rules for Calculating QuartilesStep 1 find the median of all the data (the median divides the data in half)

Step 2a find the median of the lower half this median is Q1Step 2b find the median of the upper half this median is Q3

Importantwhen n is odd include the overall median in both halveswhen n is even do not include the overall median in either half

Example 2 4 6 8 10 12 14 16 18 20 n = 10

Median m = (10+12)2 = 222 = 11

Q1 median of lower half 2 4 6 8 10

Q1 = 6

Q3 median of upper half 12 14 16 18 20

Q3 = 16

11

Pulse Rates n = 138

Stem Leaves4

3 4 5889 5 00123344410 5 555678889923 6 0001111112223333334444423 6 5555666666777778888888816 7 0000011222233444423 7 5555566666677788888899910 8 000011222410 8 55556677894 9 00122 9 584 10 0223

101 11 1

Median mean of pulses in locations 69 amp 70 median= (70+70)2=70

Q1 median of lower half (lower half = 69 smallest pulses) Q1 = pulse in ordered position 35Q1 = 63

Q3 median of upper half (upper half = 69 largest pulses) Q3= pulse in position 35 from the high end Q3=78

Below are the weights of 31 linemen on the NCSU football team What is the

value of the first quartile Q1

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 287

2 2575

3 2635

4 2625

Interquartile range another measure of spread

lower quartile Q1

middle quartile median upper quartile Q3

interquartile range (IQR)

IQR = Q3 ndash Q1

measures spread of middle 50 of the data

Example beginning pulse rates

Q3 = 78 Q1 = 63

IQR = 78 ndash 63 = 15

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 50: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Notation for Data Valuesand Sample Mean

1 2

1 2

3

The sample size is denoted by

For a variable denoted by its observations are denoted by

A common measure of center is the sample mean

The sample mean is denoted by

Shorte

n

n

y y yy

n

y

y y y y

y

n

1 21

1

ned expression for using the symbol

(uppercase Greek letter sigma)n

n

i

i n

i

i

y

y y y

yy

n

y

Simple Example of Sample Mean

Weekly TV viewing time in hours of 7 randomly selected 4th graders

19 40 16 12 10 6 and 97

1

7

1

19 40 16 12 10 6 9 112

11216

7 7

ii

ii

y

yy

Population Mean

1

population

population mea

Denoted by the Greek letter

is the size (for example =34000 for NCSU)

the value of is typically not known

we often use the sample mean

to estimat

n

e the unknown

N

ii

y

N N

y

N

value of

Connection Between Mean and Histogram

A histogram balances when supported at the mean Mean x = 1406

Histogram

0

10

20

30

40

50

60

70

118

5

125

5

132

5

139

5

146

5

153

5

16

05

Mo

re

Absences f rom Work

Fre

qu

en

cy

Frequency

The median anothermeasure of center

Given a set of n data values arranged in order of magnitude

Median= middle value n odd

mean of 2 middle values n even

Ex 2 4 6 8 10 n=5 median=6 Ex 2 4 6 8 n=4 median=(4+6)2=5

Student Pulse Rates (n=62)

38 59 60 60 62 62 63 63 64 64 65 67 68 70 70 70 70 70 70 70 71 71 72 72 73 74 74 75 75 75 75 76 77 77 77 77 78 78 79 79 80 80 80 84 84 85 85 87 90 90 91 92 93 94 94 95 96 96 96 98 98 103

Median = (75+76)2 = 755

The median splits the histogram into 2 halves of equal area

Mean balance pointMedian 50 area each half

mean 5526 years median 577years

Medians are used often

Year 2011 baseball salaries

Median $1450000 (max=$32000000 Alex Rodriguez min=$414000)

Median fan age MLB 45 NFL 43 NBA 41 NHL 39

Median existing home sales price May 2011 $166500 May 2010 $174600

Median household income (2008 dollars) 2009 $50221 2008 $52029

Examples Example n = 7

175 28 32 139 141 253 458 Example n = 7 (ordered) 28 32 139 141 175 253 458 Example n = 8

175 28 32 139 141 253 357 458

Example n =8 (ordered)

28 32 139 141 175 253 357 458

m = 141

m = (141+175)2 = 158

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496049604971524555467586

1 5245

2 49655

3 4960

4 4971

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496052455546497155877586

1 5245

2 49655

3 5546

4 4971

Properties of Mean Median1The mean and median are unique that is a

data set has only 1 mean and 1 median (the mean and median are not necessarily equal)

2The mean uses the value of every number in the data set the median does not

14

20 4 6Ex 2 4 6 8 5 5

4 2

21 4 6Ex 2 4 6 9 5 5

4 2

x m

x m

Example class pulse rates

53 64 67 67 70 76 77 77 78 83 84 85 85 89 90 90 90 90 91 96 98 103 140

23

1

23

844823

location 12th obs 85

ii

n

xx

m m

2010 2014 baseball salaries

2010

n = 845

mean = $3297828

median = $1330000

max = $33000000

2014

n = 848

mean = $3932912

median = $1456250

max = $28000000

>

Disadvantage of the mean

Can be greatly influenced by just a few observations that are much greater or much smaller than the rest of the data

Mean Median Maximum Baseball Salaries 1985 - 201419

85

1987

1989

1991

1993

1995

1997

1999

2001

2003

2005

2007

2009

2011

2013

200000

700000

1200000

1700000

2200000

2700000

3200000

3700000

0

5000000

10000000

15000000

20000000

25000000

30000000

35000000

Baseball Salaries Mean Median and Maximum 1985-2014

Mean Median Maximum

Year

Mea

n M

edia

n S

alar

y

Max

imu

m S

alar

y

Skewness comparing the mean and median

Skewed to the right (positively skewed) meangtmedian

53

490

102 7235 21 26 17 8 10 2 3 1 0 0 1

0

100

200

300

400

500

600

Freq

uenc

y

Salary ($1000s)

2011 Baseball Salaries

Skewed to the left negatively skewed

Mean lt median mean=78 median=87

Histogram of Exam Scores

0

10

20

30

20 30 40 50 60 70 80 90 100Exam Scores

Fre

qu

en

cy

Symmetric data

mean median approx equal

Bank Customers 1000-1100 am

0

5

10

15

20

Number of Customers

Fre

qu

en

cy

Section 33Describing Variability of Data

Standard Deviation

Using the Mean and Standard Deviation Together 68-95-997

Rule (Empirical Rule)

Recall 2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability

measures how ldquospread outrdquo the data is

Ways to measure variability

1 range=largest-smallest

ok sometimes in general too crude sensitive to one large or small obs

1

2 where

the middle is the mean

deviation of from the mean

( ) sum the deviations of all the s from

measure spread from the middle

i i

n

i ii

y

y y y

y y y y

1

( ) 0 always tells us nothingn

ii

y y

Example

1 2

1 2

1 2

1 2

sum of deviations from mean

49 51 50

( ) ( ) (49 50) (51 50) 1 1 0

0 100

Data set 1

Data set 2 50

( ) ( ) (0 50) (100 50) 50 50 0

x x x

x x x x

y y y

y y y y

The Sample Standard Deviation a measure of spread around the mean Square the deviation of each

observation from the mean find the square root of the ldquoaveragerdquo of these squared deviations

2

1

2

2 1

( )sample standard deviation

1

( )is called the sample variance

1

n

ii

n

ii

y ys

n

y ys

n

Calculations hellip

Mean = 634

Sum of squared deviations from mean = 852

(n minus 1) = 13 (n minus 1) is called degrees freedom (df)

s2 = variance = 85213 = 655 square inches

s = standard deviation = radic655 = 256 inches

Women height (inches)i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

2

1

2 )(1

1xx

ns

n

i

1 First calculate the variance s22 Then take the square root to get the

standard deviation s

2

1

)(1

1xx

ns

n

i

Meanplusmn 1 sd

Wersquoll never calculate these by hand so make sure to know how to get the standard deviation using your calculator Excel or other software

Population Standard Deviation

2

1

Denoted by the lower case Greek letter

is the size (for example =34000 for NCSU)

is the mean

( )population standard deviation

va

po

lue of typically not known

us

pulation

populatio

e

n

N

ii

N N

y

N

s

to estimate value of

Remarks

1 The standard deviation of a set of measurements is an estimate of the likely size of the chance error in a single measurement

Remarks (cont)

2 Note that s and s are always greater than or equal to zero

3 The larger the value of s (or s ) the greater the spread of the data

When does s=0 When does s =0

When all data values are the same

Remarks (cont)4 The standard deviation is the most

commonly used measure of risk in finance and businessndash Stocks Mutual Funds etc

5 Variance s2 sample variance 2 population variance Units are squared units of the original data square $ square gallons

Review Properties of s and s s and s are always greater than or

equal to 0

when does s = 0 s = 0 The larger the value of s (or s) the

greater the spread of the data the standard deviation of a set of

measurements is an estimate of the likely size of the chance error in a single measurement

Summary of Notation

2

SAMPLE

sample mean

sample median

sample variance

sample stand dev

y

m

s

s

2

POPULATION

population mean

population median

population variance

population stand dev

m

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

68-95-997 rule

Mean andStandard Deviation

(numerical)

Histogram(graphical)

68-95-997 rule

The 68-95-997 ruleIf the histogram of the data is

approximately bell-shaped then1) approximately of the measurements

are of the mean

that is in ( )

2) approximately of the measurement

68

within 1 standard deviation

95

within 2 standard deviation

s

are of the meas n

that is

y s y s

almost all

within 3 standard deviation

in ( 2 2 )

3) the measurements

are of the mean

that is in ( 3 3 )

s

y s y s

y s y s

68-95-997 rule 68 within 1 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

68

3434

y-s y y+s

68-95-997 rule 95 within 2 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

95

475 475

y-2s y y+2s

Example textbook costs

37548

4272

50

y

s

n

286 291 307 308 315 316 327328 340 342 346 347 348 348 349354 355 355 360 361 364 367 369371 373 377 380 381 382 385 385387 390 390 397 398 409 409 410418 422 424 425 426 428 433 434437 440 480

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( ) (33276 41820)

32percentage of data values in this interval 64

5068-95-997 rule 68

y s

y s y s

1 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 2 2 ) (29004 46092)

48percentage of data values in this interval 96

5068-95-997 rule 95

y s

y s y s

2 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 3 3 ) (24732 50364)

50percentage of data values in this interval 100

5068-95-997 rule 997

y s

y s y s

3 standard deviation interval about the mean

The best estimate of the standard deviation of the menrsquos weights

displayed in this dotplot is

1 10

2 15

3 20

4 40

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

Preceding slides Next

Z-scores Standardized Data Values

Measures the distance of a number from the mean in units of

the standard deviation

z-score corresponding to y

where

original data value

the sample mean

s the sample standard deviation

the z-score corresponding to

y yz

s

y

y

z y

Exam 1 y1 = 88 s1 = 6 your exam 1 score 91

Exam 2 y2 = 88 s2 = 10 your exam 2 score 92

Which score is better

1

2

91 88 3z 5

6 692 88 4

z 410 10

91 on exam 1 is better than 92 on exam 2

If data has mean and standard deviation

then standardizing a particular value of

indicates how many standard deviations

is above or below the mean

y s

y

y

y

Comparing SAT and ACT Scores

SAT Math Eleanorrsquos score 680

SAT mean =500 sd=100 ACT Math Geraldrsquos score 27

ACT mean=18 sd=6 Eleanorrsquos z-score z=(680-500)100=18 Geraldrsquos z-score z=(27-18)6=15 Eleanorrsquos score is better

Z-scores add to zeroStudentInstitutional Support to Athletic Depts For the 9 Public ACC

Schools 2013 ($ millions)

School Support y - ybar Z-score

Maryland 155 64 179

UVA 131 40 112

Louisville 109 18 050

UNC 92 01 003

VaTech 79 -12 -034

FSU 79 -12 -034

GaTech 71 -20 -056

NCSU 65 -26 -073

Clemson 38 -53 -147

Mean=91000 s=35697

Sum = 0 Sum = 0

Recently the mean tuition at 4-yr public collegesuniversities in the US was $6185 with a standard deviation of $1804 In NC the mean tuition was $4320 What is NCrsquos z-score

1 103

2 -103

3 239

4 1865

5 -1865

Section 34Measures of Position (also called Measures of Relative Standing)

Quartiles

5-Number Summary

Interquartile Range Another Measure of Spread

Boxplots

m = median = 34

Q1= first quartile = 23

Q3= third quartile = 42

1 1 062 2 123 3 164 4 195 5 156 6 217 7 238 6 239 5 2510 4 2811 3 2912 2 3313 1 3414 2 3615 3 3716 4 3817 5 3918 6 4119 7 4220 6 4521 5 4722 4 4923 3 5324 2 5625 1 61

Quartiles Measuring spread by examining the middleThe first quartile Q1 is the value in the

sample that has 25 of the data at or

below it (Q1 is the median of the lower

half of the sorted data)

The third quartile Q3 is the value in the

sample that has 75 of the data at or

below it (Q3 is the median of the upper

half of the sorted data)

Quartiles and median divide data into 4 pieces

Q1 M Q3

14 14 14 14

Quartiles are common measures of spread

httpoirpncsueduiradmit

httpoirpncsueduunivpeer

University of Southern California

Economic Value of College Majors

Rules for Calculating QuartilesStep 1 find the median of all the data (the median divides the data in half)

Step 2a find the median of the lower half this median is Q1Step 2b find the median of the upper half this median is Q3

Importantwhen n is odd include the overall median in both halveswhen n is even do not include the overall median in either half

Example 2 4 6 8 10 12 14 16 18 20 n = 10

Median m = (10+12)2 = 222 = 11

Q1 median of lower half 2 4 6 8 10

Q1 = 6

Q3 median of upper half 12 14 16 18 20

Q3 = 16

11

Pulse Rates n = 138

Stem Leaves4

3 4 5889 5 00123344410 5 555678889923 6 0001111112223333334444423 6 5555666666777778888888816 7 0000011222233444423 7 5555566666677788888899910 8 000011222410 8 55556677894 9 00122 9 584 10 0223

101 11 1

Median mean of pulses in locations 69 amp 70 median= (70+70)2=70

Q1 median of lower half (lower half = 69 smallest pulses) Q1 = pulse in ordered position 35Q1 = 63

Q3 median of upper half (upper half = 69 largest pulses) Q3= pulse in position 35 from the high end Q3=78

Below are the weights of 31 linemen on the NCSU football team What is the

value of the first quartile Q1

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 287

2 2575

3 2635

4 2625

Interquartile range another measure of spread

lower quartile Q1

middle quartile median upper quartile Q3

interquartile range (IQR)

IQR = Q3 ndash Q1

measures spread of middle 50 of the data

Example beginning pulse rates

Q3 = 78 Q1 = 63

IQR = 78 ndash 63 = 15

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 51: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Simple Example of Sample Mean

Weekly TV viewing time in hours of 7 randomly selected 4th graders

19 40 16 12 10 6 and 97

1

7

1

19 40 16 12 10 6 9 112

11216

7 7

ii

ii

y

yy

Population Mean

1

population

population mea

Denoted by the Greek letter

is the size (for example =34000 for NCSU)

the value of is typically not known

we often use the sample mean

to estimat

n

e the unknown

N

ii

y

N N

y

N

value of

Connection Between Mean and Histogram

A histogram balances when supported at the mean Mean x = 1406

Histogram

0

10

20

30

40

50

60

70

118

5

125

5

132

5

139

5

146

5

153

5

16

05

Mo

re

Absences f rom Work

Fre

qu

en

cy

Frequency

The median anothermeasure of center

Given a set of n data values arranged in order of magnitude

Median= middle value n odd

mean of 2 middle values n even

Ex 2 4 6 8 10 n=5 median=6 Ex 2 4 6 8 n=4 median=(4+6)2=5

Student Pulse Rates (n=62)

38 59 60 60 62 62 63 63 64 64 65 67 68 70 70 70 70 70 70 70 71 71 72 72 73 74 74 75 75 75 75 76 77 77 77 77 78 78 79 79 80 80 80 84 84 85 85 87 90 90 91 92 93 94 94 95 96 96 96 98 98 103

Median = (75+76)2 = 755

The median splits the histogram into 2 halves of equal area

Mean balance pointMedian 50 area each half

mean 5526 years median 577years

Medians are used often

Year 2011 baseball salaries

Median $1450000 (max=$32000000 Alex Rodriguez min=$414000)

Median fan age MLB 45 NFL 43 NBA 41 NHL 39

Median existing home sales price May 2011 $166500 May 2010 $174600

Median household income (2008 dollars) 2009 $50221 2008 $52029

Examples Example n = 7

175 28 32 139 141 253 458 Example n = 7 (ordered) 28 32 139 141 175 253 458 Example n = 8

175 28 32 139 141 253 357 458

Example n =8 (ordered)

28 32 139 141 175 253 357 458

m = 141

m = (141+175)2 = 158

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496049604971524555467586

1 5245

2 49655

3 4960

4 4971

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496052455546497155877586

1 5245

2 49655

3 5546

4 4971

Properties of Mean Median1The mean and median are unique that is a

data set has only 1 mean and 1 median (the mean and median are not necessarily equal)

2The mean uses the value of every number in the data set the median does not

14

20 4 6Ex 2 4 6 8 5 5

4 2

21 4 6Ex 2 4 6 9 5 5

4 2

x m

x m

Example class pulse rates

53 64 67 67 70 76 77 77 78 83 84 85 85 89 90 90 90 90 91 96 98 103 140

23

1

23

844823

location 12th obs 85

ii

n

xx

m m

2010 2014 baseball salaries

2010

n = 845

mean = $3297828

median = $1330000

max = $33000000

2014

n = 848

mean = $3932912

median = $1456250

max = $28000000

>

Disadvantage of the mean

Can be greatly influenced by just a few observations that are much greater or much smaller than the rest of the data

Mean Median Maximum Baseball Salaries 1985 - 201419

85

1987

1989

1991

1993

1995

1997

1999

2001

2003

2005

2007

2009

2011

2013

200000

700000

1200000

1700000

2200000

2700000

3200000

3700000

0

5000000

10000000

15000000

20000000

25000000

30000000

35000000

Baseball Salaries Mean Median and Maximum 1985-2014

Mean Median Maximum

Year

Mea

n M

edia

n S

alar

y

Max

imu

m S

alar

y

Skewness comparing the mean and median

Skewed to the right (positively skewed) meangtmedian

53

490

102 7235 21 26 17 8 10 2 3 1 0 0 1

0

100

200

300

400

500

600

Freq

uenc

y

Salary ($1000s)

2011 Baseball Salaries

Skewed to the left negatively skewed

Mean lt median mean=78 median=87

Histogram of Exam Scores

0

10

20

30

20 30 40 50 60 70 80 90 100Exam Scores

Fre

qu

en

cy

Symmetric data

mean median approx equal

Bank Customers 1000-1100 am

0

5

10

15

20

Number of Customers

Fre

qu

en

cy

Section 33Describing Variability of Data

Standard Deviation

Using the Mean and Standard Deviation Together 68-95-997

Rule (Empirical Rule)

Recall 2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability

measures how ldquospread outrdquo the data is

Ways to measure variability

1 range=largest-smallest

ok sometimes in general too crude sensitive to one large or small obs

1

2 where

the middle is the mean

deviation of from the mean

( ) sum the deviations of all the s from

measure spread from the middle

i i

n

i ii

y

y y y

y y y y

1

( ) 0 always tells us nothingn

ii

y y

Example

1 2

1 2

1 2

1 2

sum of deviations from mean

49 51 50

( ) ( ) (49 50) (51 50) 1 1 0

0 100

Data set 1

Data set 2 50

( ) ( ) (0 50) (100 50) 50 50 0

x x x

x x x x

y y y

y y y y

The Sample Standard Deviation a measure of spread around the mean Square the deviation of each

observation from the mean find the square root of the ldquoaveragerdquo of these squared deviations

2

1

2

2 1

( )sample standard deviation

1

( )is called the sample variance

1

n

ii

n

ii

y ys

n

y ys

n

Calculations hellip

Mean = 634

Sum of squared deviations from mean = 852

(n minus 1) = 13 (n minus 1) is called degrees freedom (df)

s2 = variance = 85213 = 655 square inches

s = standard deviation = radic655 = 256 inches

Women height (inches)i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

2

1

2 )(1

1xx

ns

n

i

1 First calculate the variance s22 Then take the square root to get the

standard deviation s

2

1

)(1

1xx

ns

n

i

Meanplusmn 1 sd

Wersquoll never calculate these by hand so make sure to know how to get the standard deviation using your calculator Excel or other software

Population Standard Deviation

2

1

Denoted by the lower case Greek letter

is the size (for example =34000 for NCSU)

is the mean

( )population standard deviation

va

po

lue of typically not known

us

pulation

populatio

e

n

N

ii

N N

y

N

s

to estimate value of

Remarks

1 The standard deviation of a set of measurements is an estimate of the likely size of the chance error in a single measurement

Remarks (cont)

2 Note that s and s are always greater than or equal to zero

3 The larger the value of s (or s ) the greater the spread of the data

When does s=0 When does s =0

When all data values are the same

Remarks (cont)4 The standard deviation is the most

commonly used measure of risk in finance and businessndash Stocks Mutual Funds etc

5 Variance s2 sample variance 2 population variance Units are squared units of the original data square $ square gallons

Review Properties of s and s s and s are always greater than or

equal to 0

when does s = 0 s = 0 The larger the value of s (or s) the

greater the spread of the data the standard deviation of a set of

measurements is an estimate of the likely size of the chance error in a single measurement

Summary of Notation

2

SAMPLE

sample mean

sample median

sample variance

sample stand dev

y

m

s

s

2

POPULATION

population mean

population median

population variance

population stand dev

m

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

68-95-997 rule

Mean andStandard Deviation

(numerical)

Histogram(graphical)

68-95-997 rule

The 68-95-997 ruleIf the histogram of the data is

approximately bell-shaped then1) approximately of the measurements

are of the mean

that is in ( )

2) approximately of the measurement

68

within 1 standard deviation

95

within 2 standard deviation

s

are of the meas n

that is

y s y s

almost all

within 3 standard deviation

in ( 2 2 )

3) the measurements

are of the mean

that is in ( 3 3 )

s

y s y s

y s y s

68-95-997 rule 68 within 1 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

68

3434

y-s y y+s

68-95-997 rule 95 within 2 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

95

475 475

y-2s y y+2s

Example textbook costs

37548

4272

50

y

s

n

286 291 307 308 315 316 327328 340 342 346 347 348 348 349354 355 355 360 361 364 367 369371 373 377 380 381 382 385 385387 390 390 397 398 409 409 410418 422 424 425 426 428 433 434437 440 480

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( ) (33276 41820)

32percentage of data values in this interval 64

5068-95-997 rule 68

y s

y s y s

1 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 2 2 ) (29004 46092)

48percentage of data values in this interval 96

5068-95-997 rule 95

y s

y s y s

2 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 3 3 ) (24732 50364)

50percentage of data values in this interval 100

5068-95-997 rule 997

y s

y s y s

3 standard deviation interval about the mean

The best estimate of the standard deviation of the menrsquos weights

displayed in this dotplot is

1 10

2 15

3 20

4 40

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

Preceding slides Next

Z-scores Standardized Data Values

Measures the distance of a number from the mean in units of

the standard deviation

z-score corresponding to y

where

original data value

the sample mean

s the sample standard deviation

the z-score corresponding to

y yz

s

y

y

z y

Exam 1 y1 = 88 s1 = 6 your exam 1 score 91

Exam 2 y2 = 88 s2 = 10 your exam 2 score 92

Which score is better

1

2

91 88 3z 5

6 692 88 4

z 410 10

91 on exam 1 is better than 92 on exam 2

If data has mean and standard deviation

then standardizing a particular value of

indicates how many standard deviations

is above or below the mean

y s

y

y

y

Comparing SAT and ACT Scores

SAT Math Eleanorrsquos score 680

SAT mean =500 sd=100 ACT Math Geraldrsquos score 27

ACT mean=18 sd=6 Eleanorrsquos z-score z=(680-500)100=18 Geraldrsquos z-score z=(27-18)6=15 Eleanorrsquos score is better

Z-scores add to zeroStudentInstitutional Support to Athletic Depts For the 9 Public ACC

Schools 2013 ($ millions)

School Support y - ybar Z-score

Maryland 155 64 179

UVA 131 40 112

Louisville 109 18 050

UNC 92 01 003

VaTech 79 -12 -034

FSU 79 -12 -034

GaTech 71 -20 -056

NCSU 65 -26 -073

Clemson 38 -53 -147

Mean=91000 s=35697

Sum = 0 Sum = 0

Recently the mean tuition at 4-yr public collegesuniversities in the US was $6185 with a standard deviation of $1804 In NC the mean tuition was $4320 What is NCrsquos z-score

1 103

2 -103

3 239

4 1865

5 -1865

Section 34Measures of Position (also called Measures of Relative Standing)

Quartiles

5-Number Summary

Interquartile Range Another Measure of Spread

Boxplots

m = median = 34

Q1= first quartile = 23

Q3= third quartile = 42

1 1 062 2 123 3 164 4 195 5 156 6 217 7 238 6 239 5 2510 4 2811 3 2912 2 3313 1 3414 2 3615 3 3716 4 3817 5 3918 6 4119 7 4220 6 4521 5 4722 4 4923 3 5324 2 5625 1 61

Quartiles Measuring spread by examining the middleThe first quartile Q1 is the value in the

sample that has 25 of the data at or

below it (Q1 is the median of the lower

half of the sorted data)

The third quartile Q3 is the value in the

sample that has 75 of the data at or

below it (Q3 is the median of the upper

half of the sorted data)

Quartiles and median divide data into 4 pieces

Q1 M Q3

14 14 14 14

Quartiles are common measures of spread

httpoirpncsueduiradmit

httpoirpncsueduunivpeer

University of Southern California

Economic Value of College Majors

Rules for Calculating QuartilesStep 1 find the median of all the data (the median divides the data in half)

Step 2a find the median of the lower half this median is Q1Step 2b find the median of the upper half this median is Q3

Importantwhen n is odd include the overall median in both halveswhen n is even do not include the overall median in either half

Example 2 4 6 8 10 12 14 16 18 20 n = 10

Median m = (10+12)2 = 222 = 11

Q1 median of lower half 2 4 6 8 10

Q1 = 6

Q3 median of upper half 12 14 16 18 20

Q3 = 16

11

Pulse Rates n = 138

Stem Leaves4

3 4 5889 5 00123344410 5 555678889923 6 0001111112223333334444423 6 5555666666777778888888816 7 0000011222233444423 7 5555566666677788888899910 8 000011222410 8 55556677894 9 00122 9 584 10 0223

101 11 1

Median mean of pulses in locations 69 amp 70 median= (70+70)2=70

Q1 median of lower half (lower half = 69 smallest pulses) Q1 = pulse in ordered position 35Q1 = 63

Q3 median of upper half (upper half = 69 largest pulses) Q3= pulse in position 35 from the high end Q3=78

Below are the weights of 31 linemen on the NCSU football team What is the

value of the first quartile Q1

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 287

2 2575

3 2635

4 2625

Interquartile range another measure of spread

lower quartile Q1

middle quartile median upper quartile Q3

interquartile range (IQR)

IQR = Q3 ndash Q1

measures spread of middle 50 of the data

Example beginning pulse rates

Q3 = 78 Q1 = 63

IQR = 78 ndash 63 = 15

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 52: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Population Mean

1

population

population mea

Denoted by the Greek letter

is the size (for example =34000 for NCSU)

the value of is typically not known

we often use the sample mean

to estimat

n

e the unknown

N

ii

y

N N

y

N

value of

Connection Between Mean and Histogram

A histogram balances when supported at the mean Mean x = 1406

Histogram

0

10

20

30

40

50

60

70

118

5

125

5

132

5

139

5

146

5

153

5

16

05

Mo

re

Absences f rom Work

Fre

qu

en

cy

Frequency

The median anothermeasure of center

Given a set of n data values arranged in order of magnitude

Median= middle value n odd

mean of 2 middle values n even

Ex 2 4 6 8 10 n=5 median=6 Ex 2 4 6 8 n=4 median=(4+6)2=5

Student Pulse Rates (n=62)

38 59 60 60 62 62 63 63 64 64 65 67 68 70 70 70 70 70 70 70 71 71 72 72 73 74 74 75 75 75 75 76 77 77 77 77 78 78 79 79 80 80 80 84 84 85 85 87 90 90 91 92 93 94 94 95 96 96 96 98 98 103

Median = (75+76)2 = 755

The median splits the histogram into 2 halves of equal area

Mean balance pointMedian 50 area each half

mean 5526 years median 577years

Medians are used often

Year 2011 baseball salaries

Median $1450000 (max=$32000000 Alex Rodriguez min=$414000)

Median fan age MLB 45 NFL 43 NBA 41 NHL 39

Median existing home sales price May 2011 $166500 May 2010 $174600

Median household income (2008 dollars) 2009 $50221 2008 $52029

Examples Example n = 7

175 28 32 139 141 253 458 Example n = 7 (ordered) 28 32 139 141 175 253 458 Example n = 8

175 28 32 139 141 253 357 458

Example n =8 (ordered)

28 32 139 141 175 253 357 458

m = 141

m = (141+175)2 = 158

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496049604971524555467586

1 5245

2 49655

3 4960

4 4971

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496052455546497155877586

1 5245

2 49655

3 5546

4 4971

Properties of Mean Median1The mean and median are unique that is a

data set has only 1 mean and 1 median (the mean and median are not necessarily equal)

2The mean uses the value of every number in the data set the median does not

14

20 4 6Ex 2 4 6 8 5 5

4 2

21 4 6Ex 2 4 6 9 5 5

4 2

x m

x m

Example class pulse rates

53 64 67 67 70 76 77 77 78 83 84 85 85 89 90 90 90 90 91 96 98 103 140

23

1

23

844823

location 12th obs 85

ii

n

xx

m m

2010 2014 baseball salaries

2010

n = 845

mean = $3297828

median = $1330000

max = $33000000

2014

n = 848

mean = $3932912

median = $1456250

max = $28000000

>

Disadvantage of the mean

Can be greatly influenced by just a few observations that are much greater or much smaller than the rest of the data

Mean Median Maximum Baseball Salaries 1985 - 201419

85

1987

1989

1991

1993

1995

1997

1999

2001

2003

2005

2007

2009

2011

2013

200000

700000

1200000

1700000

2200000

2700000

3200000

3700000

0

5000000

10000000

15000000

20000000

25000000

30000000

35000000

Baseball Salaries Mean Median and Maximum 1985-2014

Mean Median Maximum

Year

Mea

n M

edia

n S

alar

y

Max

imu

m S

alar

y

Skewness comparing the mean and median

Skewed to the right (positively skewed) meangtmedian

53

490

102 7235 21 26 17 8 10 2 3 1 0 0 1

0

100

200

300

400

500

600

Freq

uenc

y

Salary ($1000s)

2011 Baseball Salaries

Skewed to the left negatively skewed

Mean lt median mean=78 median=87

Histogram of Exam Scores

0

10

20

30

20 30 40 50 60 70 80 90 100Exam Scores

Fre

qu

en

cy

Symmetric data

mean median approx equal

Bank Customers 1000-1100 am

0

5

10

15

20

Number of Customers

Fre

qu

en

cy

Section 33Describing Variability of Data

Standard Deviation

Using the Mean and Standard Deviation Together 68-95-997

Rule (Empirical Rule)

Recall 2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability

measures how ldquospread outrdquo the data is

Ways to measure variability

1 range=largest-smallest

ok sometimes in general too crude sensitive to one large or small obs

1

2 where

the middle is the mean

deviation of from the mean

( ) sum the deviations of all the s from

measure spread from the middle

i i

n

i ii

y

y y y

y y y y

1

( ) 0 always tells us nothingn

ii

y y

Example

1 2

1 2

1 2

1 2

sum of deviations from mean

49 51 50

( ) ( ) (49 50) (51 50) 1 1 0

0 100

Data set 1

Data set 2 50

( ) ( ) (0 50) (100 50) 50 50 0

x x x

x x x x

y y y

y y y y

The Sample Standard Deviation a measure of spread around the mean Square the deviation of each

observation from the mean find the square root of the ldquoaveragerdquo of these squared deviations

2

1

2

2 1

( )sample standard deviation

1

( )is called the sample variance

1

n

ii

n

ii

y ys

n

y ys

n

Calculations hellip

Mean = 634

Sum of squared deviations from mean = 852

(n minus 1) = 13 (n minus 1) is called degrees freedom (df)

s2 = variance = 85213 = 655 square inches

s = standard deviation = radic655 = 256 inches

Women height (inches)i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

2

1

2 )(1

1xx

ns

n

i

1 First calculate the variance s22 Then take the square root to get the

standard deviation s

2

1

)(1

1xx

ns

n

i

Meanplusmn 1 sd

Wersquoll never calculate these by hand so make sure to know how to get the standard deviation using your calculator Excel or other software

Population Standard Deviation

2

1

Denoted by the lower case Greek letter

is the size (for example =34000 for NCSU)

is the mean

( )population standard deviation

va

po

lue of typically not known

us

pulation

populatio

e

n

N

ii

N N

y

N

s

to estimate value of

Remarks

1 The standard deviation of a set of measurements is an estimate of the likely size of the chance error in a single measurement

Remarks (cont)

2 Note that s and s are always greater than or equal to zero

3 The larger the value of s (or s ) the greater the spread of the data

When does s=0 When does s =0

When all data values are the same

Remarks (cont)4 The standard deviation is the most

commonly used measure of risk in finance and businessndash Stocks Mutual Funds etc

5 Variance s2 sample variance 2 population variance Units are squared units of the original data square $ square gallons

Review Properties of s and s s and s are always greater than or

equal to 0

when does s = 0 s = 0 The larger the value of s (or s) the

greater the spread of the data the standard deviation of a set of

measurements is an estimate of the likely size of the chance error in a single measurement

Summary of Notation

2

SAMPLE

sample mean

sample median

sample variance

sample stand dev

y

m

s

s

2

POPULATION

population mean

population median

population variance

population stand dev

m

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

68-95-997 rule

Mean andStandard Deviation

(numerical)

Histogram(graphical)

68-95-997 rule

The 68-95-997 ruleIf the histogram of the data is

approximately bell-shaped then1) approximately of the measurements

are of the mean

that is in ( )

2) approximately of the measurement

68

within 1 standard deviation

95

within 2 standard deviation

s

are of the meas n

that is

y s y s

almost all

within 3 standard deviation

in ( 2 2 )

3) the measurements

are of the mean

that is in ( 3 3 )

s

y s y s

y s y s

68-95-997 rule 68 within 1 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

68

3434

y-s y y+s

68-95-997 rule 95 within 2 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

95

475 475

y-2s y y+2s

Example textbook costs

37548

4272

50

y

s

n

286 291 307 308 315 316 327328 340 342 346 347 348 348 349354 355 355 360 361 364 367 369371 373 377 380 381 382 385 385387 390 390 397 398 409 409 410418 422 424 425 426 428 433 434437 440 480

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( ) (33276 41820)

32percentage of data values in this interval 64

5068-95-997 rule 68

y s

y s y s

1 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 2 2 ) (29004 46092)

48percentage of data values in this interval 96

5068-95-997 rule 95

y s

y s y s

2 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 3 3 ) (24732 50364)

50percentage of data values in this interval 100

5068-95-997 rule 997

y s

y s y s

3 standard deviation interval about the mean

The best estimate of the standard deviation of the menrsquos weights

displayed in this dotplot is

1 10

2 15

3 20

4 40

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

Preceding slides Next

Z-scores Standardized Data Values

Measures the distance of a number from the mean in units of

the standard deviation

z-score corresponding to y

where

original data value

the sample mean

s the sample standard deviation

the z-score corresponding to

y yz

s

y

y

z y

Exam 1 y1 = 88 s1 = 6 your exam 1 score 91

Exam 2 y2 = 88 s2 = 10 your exam 2 score 92

Which score is better

1

2

91 88 3z 5

6 692 88 4

z 410 10

91 on exam 1 is better than 92 on exam 2

If data has mean and standard deviation

then standardizing a particular value of

indicates how many standard deviations

is above or below the mean

y s

y

y

y

Comparing SAT and ACT Scores

SAT Math Eleanorrsquos score 680

SAT mean =500 sd=100 ACT Math Geraldrsquos score 27

ACT mean=18 sd=6 Eleanorrsquos z-score z=(680-500)100=18 Geraldrsquos z-score z=(27-18)6=15 Eleanorrsquos score is better

Z-scores add to zeroStudentInstitutional Support to Athletic Depts For the 9 Public ACC

Schools 2013 ($ millions)

School Support y - ybar Z-score

Maryland 155 64 179

UVA 131 40 112

Louisville 109 18 050

UNC 92 01 003

VaTech 79 -12 -034

FSU 79 -12 -034

GaTech 71 -20 -056

NCSU 65 -26 -073

Clemson 38 -53 -147

Mean=91000 s=35697

Sum = 0 Sum = 0

Recently the mean tuition at 4-yr public collegesuniversities in the US was $6185 with a standard deviation of $1804 In NC the mean tuition was $4320 What is NCrsquos z-score

1 103

2 -103

3 239

4 1865

5 -1865

Section 34Measures of Position (also called Measures of Relative Standing)

Quartiles

5-Number Summary

Interquartile Range Another Measure of Spread

Boxplots

m = median = 34

Q1= first quartile = 23

Q3= third quartile = 42

1 1 062 2 123 3 164 4 195 5 156 6 217 7 238 6 239 5 2510 4 2811 3 2912 2 3313 1 3414 2 3615 3 3716 4 3817 5 3918 6 4119 7 4220 6 4521 5 4722 4 4923 3 5324 2 5625 1 61

Quartiles Measuring spread by examining the middleThe first quartile Q1 is the value in the

sample that has 25 of the data at or

below it (Q1 is the median of the lower

half of the sorted data)

The third quartile Q3 is the value in the

sample that has 75 of the data at or

below it (Q3 is the median of the upper

half of the sorted data)

Quartiles and median divide data into 4 pieces

Q1 M Q3

14 14 14 14

Quartiles are common measures of spread

httpoirpncsueduiradmit

httpoirpncsueduunivpeer

University of Southern California

Economic Value of College Majors

Rules for Calculating QuartilesStep 1 find the median of all the data (the median divides the data in half)

Step 2a find the median of the lower half this median is Q1Step 2b find the median of the upper half this median is Q3

Importantwhen n is odd include the overall median in both halveswhen n is even do not include the overall median in either half

Example 2 4 6 8 10 12 14 16 18 20 n = 10

Median m = (10+12)2 = 222 = 11

Q1 median of lower half 2 4 6 8 10

Q1 = 6

Q3 median of upper half 12 14 16 18 20

Q3 = 16

11

Pulse Rates n = 138

Stem Leaves4

3 4 5889 5 00123344410 5 555678889923 6 0001111112223333334444423 6 5555666666777778888888816 7 0000011222233444423 7 5555566666677788888899910 8 000011222410 8 55556677894 9 00122 9 584 10 0223

101 11 1

Median mean of pulses in locations 69 amp 70 median= (70+70)2=70

Q1 median of lower half (lower half = 69 smallest pulses) Q1 = pulse in ordered position 35Q1 = 63

Q3 median of upper half (upper half = 69 largest pulses) Q3= pulse in position 35 from the high end Q3=78

Below are the weights of 31 linemen on the NCSU football team What is the

value of the first quartile Q1

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 287

2 2575

3 2635

4 2625

Interquartile range another measure of spread

lower quartile Q1

middle quartile median upper quartile Q3

interquartile range (IQR)

IQR = Q3 ndash Q1

measures spread of middle 50 of the data

Example beginning pulse rates

Q3 = 78 Q1 = 63

IQR = 78 ndash 63 = 15

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 53: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Connection Between Mean and Histogram

A histogram balances when supported at the mean Mean x = 1406

Histogram

0

10

20

30

40

50

60

70

118

5

125

5

132

5

139

5

146

5

153

5

16

05

Mo

re

Absences f rom Work

Fre

qu

en

cy

Frequency

The median anothermeasure of center

Given a set of n data values arranged in order of magnitude

Median= middle value n odd

mean of 2 middle values n even

Ex 2 4 6 8 10 n=5 median=6 Ex 2 4 6 8 n=4 median=(4+6)2=5

Student Pulse Rates (n=62)

38 59 60 60 62 62 63 63 64 64 65 67 68 70 70 70 70 70 70 70 71 71 72 72 73 74 74 75 75 75 75 76 77 77 77 77 78 78 79 79 80 80 80 84 84 85 85 87 90 90 91 92 93 94 94 95 96 96 96 98 98 103

Median = (75+76)2 = 755

The median splits the histogram into 2 halves of equal area

Mean balance pointMedian 50 area each half

mean 5526 years median 577years

Medians are used often

Year 2011 baseball salaries

Median $1450000 (max=$32000000 Alex Rodriguez min=$414000)

Median fan age MLB 45 NFL 43 NBA 41 NHL 39

Median existing home sales price May 2011 $166500 May 2010 $174600

Median household income (2008 dollars) 2009 $50221 2008 $52029

Examples Example n = 7

175 28 32 139 141 253 458 Example n = 7 (ordered) 28 32 139 141 175 253 458 Example n = 8

175 28 32 139 141 253 357 458

Example n =8 (ordered)

28 32 139 141 175 253 357 458

m = 141

m = (141+175)2 = 158

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496049604971524555467586

1 5245

2 49655

3 4960

4 4971

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496052455546497155877586

1 5245

2 49655

3 5546

4 4971

Properties of Mean Median1The mean and median are unique that is a

data set has only 1 mean and 1 median (the mean and median are not necessarily equal)

2The mean uses the value of every number in the data set the median does not

14

20 4 6Ex 2 4 6 8 5 5

4 2

21 4 6Ex 2 4 6 9 5 5

4 2

x m

x m

Example class pulse rates

53 64 67 67 70 76 77 77 78 83 84 85 85 89 90 90 90 90 91 96 98 103 140

23

1

23

844823

location 12th obs 85

ii

n

xx

m m

2010 2014 baseball salaries

2010

n = 845

mean = $3297828

median = $1330000

max = $33000000

2014

n = 848

mean = $3932912

median = $1456250

max = $28000000

>

Disadvantage of the mean

Can be greatly influenced by just a few observations that are much greater or much smaller than the rest of the data

Mean Median Maximum Baseball Salaries 1985 - 201419

85

1987

1989

1991

1993

1995

1997

1999

2001

2003

2005

2007

2009

2011

2013

200000

700000

1200000

1700000

2200000

2700000

3200000

3700000

0

5000000

10000000

15000000

20000000

25000000

30000000

35000000

Baseball Salaries Mean Median and Maximum 1985-2014

Mean Median Maximum

Year

Mea

n M

edia

n S

alar

y

Max

imu

m S

alar

y

Skewness comparing the mean and median

Skewed to the right (positively skewed) meangtmedian

53

490

102 7235 21 26 17 8 10 2 3 1 0 0 1

0

100

200

300

400

500

600

Freq

uenc

y

Salary ($1000s)

2011 Baseball Salaries

Skewed to the left negatively skewed

Mean lt median mean=78 median=87

Histogram of Exam Scores

0

10

20

30

20 30 40 50 60 70 80 90 100Exam Scores

Fre

qu

en

cy

Symmetric data

mean median approx equal

Bank Customers 1000-1100 am

0

5

10

15

20

Number of Customers

Fre

qu

en

cy

Section 33Describing Variability of Data

Standard Deviation

Using the Mean and Standard Deviation Together 68-95-997

Rule (Empirical Rule)

Recall 2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability

measures how ldquospread outrdquo the data is

Ways to measure variability

1 range=largest-smallest

ok sometimes in general too crude sensitive to one large or small obs

1

2 where

the middle is the mean

deviation of from the mean

( ) sum the deviations of all the s from

measure spread from the middle

i i

n

i ii

y

y y y

y y y y

1

( ) 0 always tells us nothingn

ii

y y

Example

1 2

1 2

1 2

1 2

sum of deviations from mean

49 51 50

( ) ( ) (49 50) (51 50) 1 1 0

0 100

Data set 1

Data set 2 50

( ) ( ) (0 50) (100 50) 50 50 0

x x x

x x x x

y y y

y y y y

The Sample Standard Deviation a measure of spread around the mean Square the deviation of each

observation from the mean find the square root of the ldquoaveragerdquo of these squared deviations

2

1

2

2 1

( )sample standard deviation

1

( )is called the sample variance

1

n

ii

n

ii

y ys

n

y ys

n

Calculations hellip

Mean = 634

Sum of squared deviations from mean = 852

(n minus 1) = 13 (n minus 1) is called degrees freedom (df)

s2 = variance = 85213 = 655 square inches

s = standard deviation = radic655 = 256 inches

Women height (inches)i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

2

1

2 )(1

1xx

ns

n

i

1 First calculate the variance s22 Then take the square root to get the

standard deviation s

2

1

)(1

1xx

ns

n

i

Meanplusmn 1 sd

Wersquoll never calculate these by hand so make sure to know how to get the standard deviation using your calculator Excel or other software

Population Standard Deviation

2

1

Denoted by the lower case Greek letter

is the size (for example =34000 for NCSU)

is the mean

( )population standard deviation

va

po

lue of typically not known

us

pulation

populatio

e

n

N

ii

N N

y

N

s

to estimate value of

Remarks

1 The standard deviation of a set of measurements is an estimate of the likely size of the chance error in a single measurement

Remarks (cont)

2 Note that s and s are always greater than or equal to zero

3 The larger the value of s (or s ) the greater the spread of the data

When does s=0 When does s =0

When all data values are the same

Remarks (cont)4 The standard deviation is the most

commonly used measure of risk in finance and businessndash Stocks Mutual Funds etc

5 Variance s2 sample variance 2 population variance Units are squared units of the original data square $ square gallons

Review Properties of s and s s and s are always greater than or

equal to 0

when does s = 0 s = 0 The larger the value of s (or s) the

greater the spread of the data the standard deviation of a set of

measurements is an estimate of the likely size of the chance error in a single measurement

Summary of Notation

2

SAMPLE

sample mean

sample median

sample variance

sample stand dev

y

m

s

s

2

POPULATION

population mean

population median

population variance

population stand dev

m

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

68-95-997 rule

Mean andStandard Deviation

(numerical)

Histogram(graphical)

68-95-997 rule

The 68-95-997 ruleIf the histogram of the data is

approximately bell-shaped then1) approximately of the measurements

are of the mean

that is in ( )

2) approximately of the measurement

68

within 1 standard deviation

95

within 2 standard deviation

s

are of the meas n

that is

y s y s

almost all

within 3 standard deviation

in ( 2 2 )

3) the measurements

are of the mean

that is in ( 3 3 )

s

y s y s

y s y s

68-95-997 rule 68 within 1 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

68

3434

y-s y y+s

68-95-997 rule 95 within 2 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

95

475 475

y-2s y y+2s

Example textbook costs

37548

4272

50

y

s

n

286 291 307 308 315 316 327328 340 342 346 347 348 348 349354 355 355 360 361 364 367 369371 373 377 380 381 382 385 385387 390 390 397 398 409 409 410418 422 424 425 426 428 433 434437 440 480

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( ) (33276 41820)

32percentage of data values in this interval 64

5068-95-997 rule 68

y s

y s y s

1 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 2 2 ) (29004 46092)

48percentage of data values in this interval 96

5068-95-997 rule 95

y s

y s y s

2 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 3 3 ) (24732 50364)

50percentage of data values in this interval 100

5068-95-997 rule 997

y s

y s y s

3 standard deviation interval about the mean

The best estimate of the standard deviation of the menrsquos weights

displayed in this dotplot is

1 10

2 15

3 20

4 40

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

Preceding slides Next

Z-scores Standardized Data Values

Measures the distance of a number from the mean in units of

the standard deviation

z-score corresponding to y

where

original data value

the sample mean

s the sample standard deviation

the z-score corresponding to

y yz

s

y

y

z y

Exam 1 y1 = 88 s1 = 6 your exam 1 score 91

Exam 2 y2 = 88 s2 = 10 your exam 2 score 92

Which score is better

1

2

91 88 3z 5

6 692 88 4

z 410 10

91 on exam 1 is better than 92 on exam 2

If data has mean and standard deviation

then standardizing a particular value of

indicates how many standard deviations

is above or below the mean

y s

y

y

y

Comparing SAT and ACT Scores

SAT Math Eleanorrsquos score 680

SAT mean =500 sd=100 ACT Math Geraldrsquos score 27

ACT mean=18 sd=6 Eleanorrsquos z-score z=(680-500)100=18 Geraldrsquos z-score z=(27-18)6=15 Eleanorrsquos score is better

Z-scores add to zeroStudentInstitutional Support to Athletic Depts For the 9 Public ACC

Schools 2013 ($ millions)

School Support y - ybar Z-score

Maryland 155 64 179

UVA 131 40 112

Louisville 109 18 050

UNC 92 01 003

VaTech 79 -12 -034

FSU 79 -12 -034

GaTech 71 -20 -056

NCSU 65 -26 -073

Clemson 38 -53 -147

Mean=91000 s=35697

Sum = 0 Sum = 0

Recently the mean tuition at 4-yr public collegesuniversities in the US was $6185 with a standard deviation of $1804 In NC the mean tuition was $4320 What is NCrsquos z-score

1 103

2 -103

3 239

4 1865

5 -1865

Section 34Measures of Position (also called Measures of Relative Standing)

Quartiles

5-Number Summary

Interquartile Range Another Measure of Spread

Boxplots

m = median = 34

Q1= first quartile = 23

Q3= third quartile = 42

1 1 062 2 123 3 164 4 195 5 156 6 217 7 238 6 239 5 2510 4 2811 3 2912 2 3313 1 3414 2 3615 3 3716 4 3817 5 3918 6 4119 7 4220 6 4521 5 4722 4 4923 3 5324 2 5625 1 61

Quartiles Measuring spread by examining the middleThe first quartile Q1 is the value in the

sample that has 25 of the data at or

below it (Q1 is the median of the lower

half of the sorted data)

The third quartile Q3 is the value in the

sample that has 75 of the data at or

below it (Q3 is the median of the upper

half of the sorted data)

Quartiles and median divide data into 4 pieces

Q1 M Q3

14 14 14 14

Quartiles are common measures of spread

httpoirpncsueduiradmit

httpoirpncsueduunivpeer

University of Southern California

Economic Value of College Majors

Rules for Calculating QuartilesStep 1 find the median of all the data (the median divides the data in half)

Step 2a find the median of the lower half this median is Q1Step 2b find the median of the upper half this median is Q3

Importantwhen n is odd include the overall median in both halveswhen n is even do not include the overall median in either half

Example 2 4 6 8 10 12 14 16 18 20 n = 10

Median m = (10+12)2 = 222 = 11

Q1 median of lower half 2 4 6 8 10

Q1 = 6

Q3 median of upper half 12 14 16 18 20

Q3 = 16

11

Pulse Rates n = 138

Stem Leaves4

3 4 5889 5 00123344410 5 555678889923 6 0001111112223333334444423 6 5555666666777778888888816 7 0000011222233444423 7 5555566666677788888899910 8 000011222410 8 55556677894 9 00122 9 584 10 0223

101 11 1

Median mean of pulses in locations 69 amp 70 median= (70+70)2=70

Q1 median of lower half (lower half = 69 smallest pulses) Q1 = pulse in ordered position 35Q1 = 63

Q3 median of upper half (upper half = 69 largest pulses) Q3= pulse in position 35 from the high end Q3=78

Below are the weights of 31 linemen on the NCSU football team What is the

value of the first quartile Q1

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 287

2 2575

3 2635

4 2625

Interquartile range another measure of spread

lower quartile Q1

middle quartile median upper quartile Q3

interquartile range (IQR)

IQR = Q3 ndash Q1

measures spread of middle 50 of the data

Example beginning pulse rates

Q3 = 78 Q1 = 63

IQR = 78 ndash 63 = 15

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 54: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

The median anothermeasure of center

Given a set of n data values arranged in order of magnitude

Median= middle value n odd

mean of 2 middle values n even

Ex 2 4 6 8 10 n=5 median=6 Ex 2 4 6 8 n=4 median=(4+6)2=5

Student Pulse Rates (n=62)

38 59 60 60 62 62 63 63 64 64 65 67 68 70 70 70 70 70 70 70 71 71 72 72 73 74 74 75 75 75 75 76 77 77 77 77 78 78 79 79 80 80 80 84 84 85 85 87 90 90 91 92 93 94 94 95 96 96 96 98 98 103

Median = (75+76)2 = 755

The median splits the histogram into 2 halves of equal area

Mean balance pointMedian 50 area each half

mean 5526 years median 577years

Medians are used often

Year 2011 baseball salaries

Median $1450000 (max=$32000000 Alex Rodriguez min=$414000)

Median fan age MLB 45 NFL 43 NBA 41 NHL 39

Median existing home sales price May 2011 $166500 May 2010 $174600

Median household income (2008 dollars) 2009 $50221 2008 $52029

Examples Example n = 7

175 28 32 139 141 253 458 Example n = 7 (ordered) 28 32 139 141 175 253 458 Example n = 8

175 28 32 139 141 253 357 458

Example n =8 (ordered)

28 32 139 141 175 253 357 458

m = 141

m = (141+175)2 = 158

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496049604971524555467586

1 5245

2 49655

3 4960

4 4971

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496052455546497155877586

1 5245

2 49655

3 5546

4 4971

Properties of Mean Median1The mean and median are unique that is a

data set has only 1 mean and 1 median (the mean and median are not necessarily equal)

2The mean uses the value of every number in the data set the median does not

14

20 4 6Ex 2 4 6 8 5 5

4 2

21 4 6Ex 2 4 6 9 5 5

4 2

x m

x m

Example class pulse rates

53 64 67 67 70 76 77 77 78 83 84 85 85 89 90 90 90 90 91 96 98 103 140

23

1

23

844823

location 12th obs 85

ii

n

xx

m m

2010 2014 baseball salaries

2010

n = 845

mean = $3297828

median = $1330000

max = $33000000

2014

n = 848

mean = $3932912

median = $1456250

max = $28000000

>

Disadvantage of the mean

Can be greatly influenced by just a few observations that are much greater or much smaller than the rest of the data

Mean Median Maximum Baseball Salaries 1985 - 201419

85

1987

1989

1991

1993

1995

1997

1999

2001

2003

2005

2007

2009

2011

2013

200000

700000

1200000

1700000

2200000

2700000

3200000

3700000

0

5000000

10000000

15000000

20000000

25000000

30000000

35000000

Baseball Salaries Mean Median and Maximum 1985-2014

Mean Median Maximum

Year

Mea

n M

edia

n S

alar

y

Max

imu

m S

alar

y

Skewness comparing the mean and median

Skewed to the right (positively skewed) meangtmedian

53

490

102 7235 21 26 17 8 10 2 3 1 0 0 1

0

100

200

300

400

500

600

Freq

uenc

y

Salary ($1000s)

2011 Baseball Salaries

Skewed to the left negatively skewed

Mean lt median mean=78 median=87

Histogram of Exam Scores

0

10

20

30

20 30 40 50 60 70 80 90 100Exam Scores

Fre

qu

en

cy

Symmetric data

mean median approx equal

Bank Customers 1000-1100 am

0

5

10

15

20

Number of Customers

Fre

qu

en

cy

Section 33Describing Variability of Data

Standard Deviation

Using the Mean and Standard Deviation Together 68-95-997

Rule (Empirical Rule)

Recall 2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability

measures how ldquospread outrdquo the data is

Ways to measure variability

1 range=largest-smallest

ok sometimes in general too crude sensitive to one large or small obs

1

2 where

the middle is the mean

deviation of from the mean

( ) sum the deviations of all the s from

measure spread from the middle

i i

n

i ii

y

y y y

y y y y

1

( ) 0 always tells us nothingn

ii

y y

Example

1 2

1 2

1 2

1 2

sum of deviations from mean

49 51 50

( ) ( ) (49 50) (51 50) 1 1 0

0 100

Data set 1

Data set 2 50

( ) ( ) (0 50) (100 50) 50 50 0

x x x

x x x x

y y y

y y y y

The Sample Standard Deviation a measure of spread around the mean Square the deviation of each

observation from the mean find the square root of the ldquoaveragerdquo of these squared deviations

2

1

2

2 1

( )sample standard deviation

1

( )is called the sample variance

1

n

ii

n

ii

y ys

n

y ys

n

Calculations hellip

Mean = 634

Sum of squared deviations from mean = 852

(n minus 1) = 13 (n minus 1) is called degrees freedom (df)

s2 = variance = 85213 = 655 square inches

s = standard deviation = radic655 = 256 inches

Women height (inches)i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

2

1

2 )(1

1xx

ns

n

i

1 First calculate the variance s22 Then take the square root to get the

standard deviation s

2

1

)(1

1xx

ns

n

i

Meanplusmn 1 sd

Wersquoll never calculate these by hand so make sure to know how to get the standard deviation using your calculator Excel or other software

Population Standard Deviation

2

1

Denoted by the lower case Greek letter

is the size (for example =34000 for NCSU)

is the mean

( )population standard deviation

va

po

lue of typically not known

us

pulation

populatio

e

n

N

ii

N N

y

N

s

to estimate value of

Remarks

1 The standard deviation of a set of measurements is an estimate of the likely size of the chance error in a single measurement

Remarks (cont)

2 Note that s and s are always greater than or equal to zero

3 The larger the value of s (or s ) the greater the spread of the data

When does s=0 When does s =0

When all data values are the same

Remarks (cont)4 The standard deviation is the most

commonly used measure of risk in finance and businessndash Stocks Mutual Funds etc

5 Variance s2 sample variance 2 population variance Units are squared units of the original data square $ square gallons

Review Properties of s and s s and s are always greater than or

equal to 0

when does s = 0 s = 0 The larger the value of s (or s) the

greater the spread of the data the standard deviation of a set of

measurements is an estimate of the likely size of the chance error in a single measurement

Summary of Notation

2

SAMPLE

sample mean

sample median

sample variance

sample stand dev

y

m

s

s

2

POPULATION

population mean

population median

population variance

population stand dev

m

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

68-95-997 rule

Mean andStandard Deviation

(numerical)

Histogram(graphical)

68-95-997 rule

The 68-95-997 ruleIf the histogram of the data is

approximately bell-shaped then1) approximately of the measurements

are of the mean

that is in ( )

2) approximately of the measurement

68

within 1 standard deviation

95

within 2 standard deviation

s

are of the meas n

that is

y s y s

almost all

within 3 standard deviation

in ( 2 2 )

3) the measurements

are of the mean

that is in ( 3 3 )

s

y s y s

y s y s

68-95-997 rule 68 within 1 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

68

3434

y-s y y+s

68-95-997 rule 95 within 2 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

95

475 475

y-2s y y+2s

Example textbook costs

37548

4272

50

y

s

n

286 291 307 308 315 316 327328 340 342 346 347 348 348 349354 355 355 360 361 364 367 369371 373 377 380 381 382 385 385387 390 390 397 398 409 409 410418 422 424 425 426 428 433 434437 440 480

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( ) (33276 41820)

32percentage of data values in this interval 64

5068-95-997 rule 68

y s

y s y s

1 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 2 2 ) (29004 46092)

48percentage of data values in this interval 96

5068-95-997 rule 95

y s

y s y s

2 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 3 3 ) (24732 50364)

50percentage of data values in this interval 100

5068-95-997 rule 997

y s

y s y s

3 standard deviation interval about the mean

The best estimate of the standard deviation of the menrsquos weights

displayed in this dotplot is

1 10

2 15

3 20

4 40

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

Preceding slides Next

Z-scores Standardized Data Values

Measures the distance of a number from the mean in units of

the standard deviation

z-score corresponding to y

where

original data value

the sample mean

s the sample standard deviation

the z-score corresponding to

y yz

s

y

y

z y

Exam 1 y1 = 88 s1 = 6 your exam 1 score 91

Exam 2 y2 = 88 s2 = 10 your exam 2 score 92

Which score is better

1

2

91 88 3z 5

6 692 88 4

z 410 10

91 on exam 1 is better than 92 on exam 2

If data has mean and standard deviation

then standardizing a particular value of

indicates how many standard deviations

is above or below the mean

y s

y

y

y

Comparing SAT and ACT Scores

SAT Math Eleanorrsquos score 680

SAT mean =500 sd=100 ACT Math Geraldrsquos score 27

ACT mean=18 sd=6 Eleanorrsquos z-score z=(680-500)100=18 Geraldrsquos z-score z=(27-18)6=15 Eleanorrsquos score is better

Z-scores add to zeroStudentInstitutional Support to Athletic Depts For the 9 Public ACC

Schools 2013 ($ millions)

School Support y - ybar Z-score

Maryland 155 64 179

UVA 131 40 112

Louisville 109 18 050

UNC 92 01 003

VaTech 79 -12 -034

FSU 79 -12 -034

GaTech 71 -20 -056

NCSU 65 -26 -073

Clemson 38 -53 -147

Mean=91000 s=35697

Sum = 0 Sum = 0

Recently the mean tuition at 4-yr public collegesuniversities in the US was $6185 with a standard deviation of $1804 In NC the mean tuition was $4320 What is NCrsquos z-score

1 103

2 -103

3 239

4 1865

5 -1865

Section 34Measures of Position (also called Measures of Relative Standing)

Quartiles

5-Number Summary

Interquartile Range Another Measure of Spread

Boxplots

m = median = 34

Q1= first quartile = 23

Q3= third quartile = 42

1 1 062 2 123 3 164 4 195 5 156 6 217 7 238 6 239 5 2510 4 2811 3 2912 2 3313 1 3414 2 3615 3 3716 4 3817 5 3918 6 4119 7 4220 6 4521 5 4722 4 4923 3 5324 2 5625 1 61

Quartiles Measuring spread by examining the middleThe first quartile Q1 is the value in the

sample that has 25 of the data at or

below it (Q1 is the median of the lower

half of the sorted data)

The third quartile Q3 is the value in the

sample that has 75 of the data at or

below it (Q3 is the median of the upper

half of the sorted data)

Quartiles and median divide data into 4 pieces

Q1 M Q3

14 14 14 14

Quartiles are common measures of spread

httpoirpncsueduiradmit

httpoirpncsueduunivpeer

University of Southern California

Economic Value of College Majors

Rules for Calculating QuartilesStep 1 find the median of all the data (the median divides the data in half)

Step 2a find the median of the lower half this median is Q1Step 2b find the median of the upper half this median is Q3

Importantwhen n is odd include the overall median in both halveswhen n is even do not include the overall median in either half

Example 2 4 6 8 10 12 14 16 18 20 n = 10

Median m = (10+12)2 = 222 = 11

Q1 median of lower half 2 4 6 8 10

Q1 = 6

Q3 median of upper half 12 14 16 18 20

Q3 = 16

11

Pulse Rates n = 138

Stem Leaves4

3 4 5889 5 00123344410 5 555678889923 6 0001111112223333334444423 6 5555666666777778888888816 7 0000011222233444423 7 5555566666677788888899910 8 000011222410 8 55556677894 9 00122 9 584 10 0223

101 11 1

Median mean of pulses in locations 69 amp 70 median= (70+70)2=70

Q1 median of lower half (lower half = 69 smallest pulses) Q1 = pulse in ordered position 35Q1 = 63

Q3 median of upper half (upper half = 69 largest pulses) Q3= pulse in position 35 from the high end Q3=78

Below are the weights of 31 linemen on the NCSU football team What is the

value of the first quartile Q1

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 287

2 2575

3 2635

4 2625

Interquartile range another measure of spread

lower quartile Q1

middle quartile median upper quartile Q3

interquartile range (IQR)

IQR = Q3 ndash Q1

measures spread of middle 50 of the data

Example beginning pulse rates

Q3 = 78 Q1 = 63

IQR = 78 ndash 63 = 15

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 55: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Student Pulse Rates (n=62)

38 59 60 60 62 62 63 63 64 64 65 67 68 70 70 70 70 70 70 70 71 71 72 72 73 74 74 75 75 75 75 76 77 77 77 77 78 78 79 79 80 80 80 84 84 85 85 87 90 90 91 92 93 94 94 95 96 96 96 98 98 103

Median = (75+76)2 = 755

The median splits the histogram into 2 halves of equal area

Mean balance pointMedian 50 area each half

mean 5526 years median 577years

Medians are used often

Year 2011 baseball salaries

Median $1450000 (max=$32000000 Alex Rodriguez min=$414000)

Median fan age MLB 45 NFL 43 NBA 41 NHL 39

Median existing home sales price May 2011 $166500 May 2010 $174600

Median household income (2008 dollars) 2009 $50221 2008 $52029

Examples Example n = 7

175 28 32 139 141 253 458 Example n = 7 (ordered) 28 32 139 141 175 253 458 Example n = 8

175 28 32 139 141 253 357 458

Example n =8 (ordered)

28 32 139 141 175 253 357 458

m = 141

m = (141+175)2 = 158

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496049604971524555467586

1 5245

2 49655

3 4960

4 4971

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496052455546497155877586

1 5245

2 49655

3 5546

4 4971

Properties of Mean Median1The mean and median are unique that is a

data set has only 1 mean and 1 median (the mean and median are not necessarily equal)

2The mean uses the value of every number in the data set the median does not

14

20 4 6Ex 2 4 6 8 5 5

4 2

21 4 6Ex 2 4 6 9 5 5

4 2

x m

x m

Example class pulse rates

53 64 67 67 70 76 77 77 78 83 84 85 85 89 90 90 90 90 91 96 98 103 140

23

1

23

844823

location 12th obs 85

ii

n

xx

m m

2010 2014 baseball salaries

2010

n = 845

mean = $3297828

median = $1330000

max = $33000000

2014

n = 848

mean = $3932912

median = $1456250

max = $28000000

>

Disadvantage of the mean

Can be greatly influenced by just a few observations that are much greater or much smaller than the rest of the data

Mean Median Maximum Baseball Salaries 1985 - 201419

85

1987

1989

1991

1993

1995

1997

1999

2001

2003

2005

2007

2009

2011

2013

200000

700000

1200000

1700000

2200000

2700000

3200000

3700000

0

5000000

10000000

15000000

20000000

25000000

30000000

35000000

Baseball Salaries Mean Median and Maximum 1985-2014

Mean Median Maximum

Year

Mea

n M

edia

n S

alar

y

Max

imu

m S

alar

y

Skewness comparing the mean and median

Skewed to the right (positively skewed) meangtmedian

53

490

102 7235 21 26 17 8 10 2 3 1 0 0 1

0

100

200

300

400

500

600

Freq

uenc

y

Salary ($1000s)

2011 Baseball Salaries

Skewed to the left negatively skewed

Mean lt median mean=78 median=87

Histogram of Exam Scores

0

10

20

30

20 30 40 50 60 70 80 90 100Exam Scores

Fre

qu

en

cy

Symmetric data

mean median approx equal

Bank Customers 1000-1100 am

0

5

10

15

20

Number of Customers

Fre

qu

en

cy

Section 33Describing Variability of Data

Standard Deviation

Using the Mean and Standard Deviation Together 68-95-997

Rule (Empirical Rule)

Recall 2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability

measures how ldquospread outrdquo the data is

Ways to measure variability

1 range=largest-smallest

ok sometimes in general too crude sensitive to one large or small obs

1

2 where

the middle is the mean

deviation of from the mean

( ) sum the deviations of all the s from

measure spread from the middle

i i

n

i ii

y

y y y

y y y y

1

( ) 0 always tells us nothingn

ii

y y

Example

1 2

1 2

1 2

1 2

sum of deviations from mean

49 51 50

( ) ( ) (49 50) (51 50) 1 1 0

0 100

Data set 1

Data set 2 50

( ) ( ) (0 50) (100 50) 50 50 0

x x x

x x x x

y y y

y y y y

The Sample Standard Deviation a measure of spread around the mean Square the deviation of each

observation from the mean find the square root of the ldquoaveragerdquo of these squared deviations

2

1

2

2 1

( )sample standard deviation

1

( )is called the sample variance

1

n

ii

n

ii

y ys

n

y ys

n

Calculations hellip

Mean = 634

Sum of squared deviations from mean = 852

(n minus 1) = 13 (n minus 1) is called degrees freedom (df)

s2 = variance = 85213 = 655 square inches

s = standard deviation = radic655 = 256 inches

Women height (inches)i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

2

1

2 )(1

1xx

ns

n

i

1 First calculate the variance s22 Then take the square root to get the

standard deviation s

2

1

)(1

1xx

ns

n

i

Meanplusmn 1 sd

Wersquoll never calculate these by hand so make sure to know how to get the standard deviation using your calculator Excel or other software

Population Standard Deviation

2

1

Denoted by the lower case Greek letter

is the size (for example =34000 for NCSU)

is the mean

( )population standard deviation

va

po

lue of typically not known

us

pulation

populatio

e

n

N

ii

N N

y

N

s

to estimate value of

Remarks

1 The standard deviation of a set of measurements is an estimate of the likely size of the chance error in a single measurement

Remarks (cont)

2 Note that s and s are always greater than or equal to zero

3 The larger the value of s (or s ) the greater the spread of the data

When does s=0 When does s =0

When all data values are the same

Remarks (cont)4 The standard deviation is the most

commonly used measure of risk in finance and businessndash Stocks Mutual Funds etc

5 Variance s2 sample variance 2 population variance Units are squared units of the original data square $ square gallons

Review Properties of s and s s and s are always greater than or

equal to 0

when does s = 0 s = 0 The larger the value of s (or s) the

greater the spread of the data the standard deviation of a set of

measurements is an estimate of the likely size of the chance error in a single measurement

Summary of Notation

2

SAMPLE

sample mean

sample median

sample variance

sample stand dev

y

m

s

s

2

POPULATION

population mean

population median

population variance

population stand dev

m

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

68-95-997 rule

Mean andStandard Deviation

(numerical)

Histogram(graphical)

68-95-997 rule

The 68-95-997 ruleIf the histogram of the data is

approximately bell-shaped then1) approximately of the measurements

are of the mean

that is in ( )

2) approximately of the measurement

68

within 1 standard deviation

95

within 2 standard deviation

s

are of the meas n

that is

y s y s

almost all

within 3 standard deviation

in ( 2 2 )

3) the measurements

are of the mean

that is in ( 3 3 )

s

y s y s

y s y s

68-95-997 rule 68 within 1 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

68

3434

y-s y y+s

68-95-997 rule 95 within 2 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

95

475 475

y-2s y y+2s

Example textbook costs

37548

4272

50

y

s

n

286 291 307 308 315 316 327328 340 342 346 347 348 348 349354 355 355 360 361 364 367 369371 373 377 380 381 382 385 385387 390 390 397 398 409 409 410418 422 424 425 426 428 433 434437 440 480

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( ) (33276 41820)

32percentage of data values in this interval 64

5068-95-997 rule 68

y s

y s y s

1 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 2 2 ) (29004 46092)

48percentage of data values in this interval 96

5068-95-997 rule 95

y s

y s y s

2 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 3 3 ) (24732 50364)

50percentage of data values in this interval 100

5068-95-997 rule 997

y s

y s y s

3 standard deviation interval about the mean

The best estimate of the standard deviation of the menrsquos weights

displayed in this dotplot is

1 10

2 15

3 20

4 40

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

Preceding slides Next

Z-scores Standardized Data Values

Measures the distance of a number from the mean in units of

the standard deviation

z-score corresponding to y

where

original data value

the sample mean

s the sample standard deviation

the z-score corresponding to

y yz

s

y

y

z y

Exam 1 y1 = 88 s1 = 6 your exam 1 score 91

Exam 2 y2 = 88 s2 = 10 your exam 2 score 92

Which score is better

1

2

91 88 3z 5

6 692 88 4

z 410 10

91 on exam 1 is better than 92 on exam 2

If data has mean and standard deviation

then standardizing a particular value of

indicates how many standard deviations

is above or below the mean

y s

y

y

y

Comparing SAT and ACT Scores

SAT Math Eleanorrsquos score 680

SAT mean =500 sd=100 ACT Math Geraldrsquos score 27

ACT mean=18 sd=6 Eleanorrsquos z-score z=(680-500)100=18 Geraldrsquos z-score z=(27-18)6=15 Eleanorrsquos score is better

Z-scores add to zeroStudentInstitutional Support to Athletic Depts For the 9 Public ACC

Schools 2013 ($ millions)

School Support y - ybar Z-score

Maryland 155 64 179

UVA 131 40 112

Louisville 109 18 050

UNC 92 01 003

VaTech 79 -12 -034

FSU 79 -12 -034

GaTech 71 -20 -056

NCSU 65 -26 -073

Clemson 38 -53 -147

Mean=91000 s=35697

Sum = 0 Sum = 0

Recently the mean tuition at 4-yr public collegesuniversities in the US was $6185 with a standard deviation of $1804 In NC the mean tuition was $4320 What is NCrsquos z-score

1 103

2 -103

3 239

4 1865

5 -1865

Section 34Measures of Position (also called Measures of Relative Standing)

Quartiles

5-Number Summary

Interquartile Range Another Measure of Spread

Boxplots

m = median = 34

Q1= first quartile = 23

Q3= third quartile = 42

1 1 062 2 123 3 164 4 195 5 156 6 217 7 238 6 239 5 2510 4 2811 3 2912 2 3313 1 3414 2 3615 3 3716 4 3817 5 3918 6 4119 7 4220 6 4521 5 4722 4 4923 3 5324 2 5625 1 61

Quartiles Measuring spread by examining the middleThe first quartile Q1 is the value in the

sample that has 25 of the data at or

below it (Q1 is the median of the lower

half of the sorted data)

The third quartile Q3 is the value in the

sample that has 75 of the data at or

below it (Q3 is the median of the upper

half of the sorted data)

Quartiles and median divide data into 4 pieces

Q1 M Q3

14 14 14 14

Quartiles are common measures of spread

httpoirpncsueduiradmit

httpoirpncsueduunivpeer

University of Southern California

Economic Value of College Majors

Rules for Calculating QuartilesStep 1 find the median of all the data (the median divides the data in half)

Step 2a find the median of the lower half this median is Q1Step 2b find the median of the upper half this median is Q3

Importantwhen n is odd include the overall median in both halveswhen n is even do not include the overall median in either half

Example 2 4 6 8 10 12 14 16 18 20 n = 10

Median m = (10+12)2 = 222 = 11

Q1 median of lower half 2 4 6 8 10

Q1 = 6

Q3 median of upper half 12 14 16 18 20

Q3 = 16

11

Pulse Rates n = 138

Stem Leaves4

3 4 5889 5 00123344410 5 555678889923 6 0001111112223333334444423 6 5555666666777778888888816 7 0000011222233444423 7 5555566666677788888899910 8 000011222410 8 55556677894 9 00122 9 584 10 0223

101 11 1

Median mean of pulses in locations 69 amp 70 median= (70+70)2=70

Q1 median of lower half (lower half = 69 smallest pulses) Q1 = pulse in ordered position 35Q1 = 63

Q3 median of upper half (upper half = 69 largest pulses) Q3= pulse in position 35 from the high end Q3=78

Below are the weights of 31 linemen on the NCSU football team What is the

value of the first quartile Q1

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 287

2 2575

3 2635

4 2625

Interquartile range another measure of spread

lower quartile Q1

middle quartile median upper quartile Q3

interquartile range (IQR)

IQR = Q3 ndash Q1

measures spread of middle 50 of the data

Example beginning pulse rates

Q3 = 78 Q1 = 63

IQR = 78 ndash 63 = 15

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 56: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

The median splits the histogram into 2 halves of equal area

Mean balance pointMedian 50 area each half

mean 5526 years median 577years

Medians are used often

Year 2011 baseball salaries

Median $1450000 (max=$32000000 Alex Rodriguez min=$414000)

Median fan age MLB 45 NFL 43 NBA 41 NHL 39

Median existing home sales price May 2011 $166500 May 2010 $174600

Median household income (2008 dollars) 2009 $50221 2008 $52029

Examples Example n = 7

175 28 32 139 141 253 458 Example n = 7 (ordered) 28 32 139 141 175 253 458 Example n = 8

175 28 32 139 141 253 357 458

Example n =8 (ordered)

28 32 139 141 175 253 357 458

m = 141

m = (141+175)2 = 158

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496049604971524555467586

1 5245

2 49655

3 4960

4 4971

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496052455546497155877586

1 5245

2 49655

3 5546

4 4971

Properties of Mean Median1The mean and median are unique that is a

data set has only 1 mean and 1 median (the mean and median are not necessarily equal)

2The mean uses the value of every number in the data set the median does not

14

20 4 6Ex 2 4 6 8 5 5

4 2

21 4 6Ex 2 4 6 9 5 5

4 2

x m

x m

Example class pulse rates

53 64 67 67 70 76 77 77 78 83 84 85 85 89 90 90 90 90 91 96 98 103 140

23

1

23

844823

location 12th obs 85

ii

n

xx

m m

2010 2014 baseball salaries

2010

n = 845

mean = $3297828

median = $1330000

max = $33000000

2014

n = 848

mean = $3932912

median = $1456250

max = $28000000

>

Disadvantage of the mean

Can be greatly influenced by just a few observations that are much greater or much smaller than the rest of the data

Mean Median Maximum Baseball Salaries 1985 - 201419

85

1987

1989

1991

1993

1995

1997

1999

2001

2003

2005

2007

2009

2011

2013

200000

700000

1200000

1700000

2200000

2700000

3200000

3700000

0

5000000

10000000

15000000

20000000

25000000

30000000

35000000

Baseball Salaries Mean Median and Maximum 1985-2014

Mean Median Maximum

Year

Mea

n M

edia

n S

alar

y

Max

imu

m S

alar

y

Skewness comparing the mean and median

Skewed to the right (positively skewed) meangtmedian

53

490

102 7235 21 26 17 8 10 2 3 1 0 0 1

0

100

200

300

400

500

600

Freq

uenc

y

Salary ($1000s)

2011 Baseball Salaries

Skewed to the left negatively skewed

Mean lt median mean=78 median=87

Histogram of Exam Scores

0

10

20

30

20 30 40 50 60 70 80 90 100Exam Scores

Fre

qu

en

cy

Symmetric data

mean median approx equal

Bank Customers 1000-1100 am

0

5

10

15

20

Number of Customers

Fre

qu

en

cy

Section 33Describing Variability of Data

Standard Deviation

Using the Mean and Standard Deviation Together 68-95-997

Rule (Empirical Rule)

Recall 2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability

measures how ldquospread outrdquo the data is

Ways to measure variability

1 range=largest-smallest

ok sometimes in general too crude sensitive to one large or small obs

1

2 where

the middle is the mean

deviation of from the mean

( ) sum the deviations of all the s from

measure spread from the middle

i i

n

i ii

y

y y y

y y y y

1

( ) 0 always tells us nothingn

ii

y y

Example

1 2

1 2

1 2

1 2

sum of deviations from mean

49 51 50

( ) ( ) (49 50) (51 50) 1 1 0

0 100

Data set 1

Data set 2 50

( ) ( ) (0 50) (100 50) 50 50 0

x x x

x x x x

y y y

y y y y

The Sample Standard Deviation a measure of spread around the mean Square the deviation of each

observation from the mean find the square root of the ldquoaveragerdquo of these squared deviations

2

1

2

2 1

( )sample standard deviation

1

( )is called the sample variance

1

n

ii

n

ii

y ys

n

y ys

n

Calculations hellip

Mean = 634

Sum of squared deviations from mean = 852

(n minus 1) = 13 (n minus 1) is called degrees freedom (df)

s2 = variance = 85213 = 655 square inches

s = standard deviation = radic655 = 256 inches

Women height (inches)i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

2

1

2 )(1

1xx

ns

n

i

1 First calculate the variance s22 Then take the square root to get the

standard deviation s

2

1

)(1

1xx

ns

n

i

Meanplusmn 1 sd

Wersquoll never calculate these by hand so make sure to know how to get the standard deviation using your calculator Excel or other software

Population Standard Deviation

2

1

Denoted by the lower case Greek letter

is the size (for example =34000 for NCSU)

is the mean

( )population standard deviation

va

po

lue of typically not known

us

pulation

populatio

e

n

N

ii

N N

y

N

s

to estimate value of

Remarks

1 The standard deviation of a set of measurements is an estimate of the likely size of the chance error in a single measurement

Remarks (cont)

2 Note that s and s are always greater than or equal to zero

3 The larger the value of s (or s ) the greater the spread of the data

When does s=0 When does s =0

When all data values are the same

Remarks (cont)4 The standard deviation is the most

commonly used measure of risk in finance and businessndash Stocks Mutual Funds etc

5 Variance s2 sample variance 2 population variance Units are squared units of the original data square $ square gallons

Review Properties of s and s s and s are always greater than or

equal to 0

when does s = 0 s = 0 The larger the value of s (or s) the

greater the spread of the data the standard deviation of a set of

measurements is an estimate of the likely size of the chance error in a single measurement

Summary of Notation

2

SAMPLE

sample mean

sample median

sample variance

sample stand dev

y

m

s

s

2

POPULATION

population mean

population median

population variance

population stand dev

m

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

68-95-997 rule

Mean andStandard Deviation

(numerical)

Histogram(graphical)

68-95-997 rule

The 68-95-997 ruleIf the histogram of the data is

approximately bell-shaped then1) approximately of the measurements

are of the mean

that is in ( )

2) approximately of the measurement

68

within 1 standard deviation

95

within 2 standard deviation

s

are of the meas n

that is

y s y s

almost all

within 3 standard deviation

in ( 2 2 )

3) the measurements

are of the mean

that is in ( 3 3 )

s

y s y s

y s y s

68-95-997 rule 68 within 1 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

68

3434

y-s y y+s

68-95-997 rule 95 within 2 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

95

475 475

y-2s y y+2s

Example textbook costs

37548

4272

50

y

s

n

286 291 307 308 315 316 327328 340 342 346 347 348 348 349354 355 355 360 361 364 367 369371 373 377 380 381 382 385 385387 390 390 397 398 409 409 410418 422 424 425 426 428 433 434437 440 480

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( ) (33276 41820)

32percentage of data values in this interval 64

5068-95-997 rule 68

y s

y s y s

1 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 2 2 ) (29004 46092)

48percentage of data values in this interval 96

5068-95-997 rule 95

y s

y s y s

2 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 3 3 ) (24732 50364)

50percentage of data values in this interval 100

5068-95-997 rule 997

y s

y s y s

3 standard deviation interval about the mean

The best estimate of the standard deviation of the menrsquos weights

displayed in this dotplot is

1 10

2 15

3 20

4 40

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

Preceding slides Next

Z-scores Standardized Data Values

Measures the distance of a number from the mean in units of

the standard deviation

z-score corresponding to y

where

original data value

the sample mean

s the sample standard deviation

the z-score corresponding to

y yz

s

y

y

z y

Exam 1 y1 = 88 s1 = 6 your exam 1 score 91

Exam 2 y2 = 88 s2 = 10 your exam 2 score 92

Which score is better

1

2

91 88 3z 5

6 692 88 4

z 410 10

91 on exam 1 is better than 92 on exam 2

If data has mean and standard deviation

then standardizing a particular value of

indicates how many standard deviations

is above or below the mean

y s

y

y

y

Comparing SAT and ACT Scores

SAT Math Eleanorrsquos score 680

SAT mean =500 sd=100 ACT Math Geraldrsquos score 27

ACT mean=18 sd=6 Eleanorrsquos z-score z=(680-500)100=18 Geraldrsquos z-score z=(27-18)6=15 Eleanorrsquos score is better

Z-scores add to zeroStudentInstitutional Support to Athletic Depts For the 9 Public ACC

Schools 2013 ($ millions)

School Support y - ybar Z-score

Maryland 155 64 179

UVA 131 40 112

Louisville 109 18 050

UNC 92 01 003

VaTech 79 -12 -034

FSU 79 -12 -034

GaTech 71 -20 -056

NCSU 65 -26 -073

Clemson 38 -53 -147

Mean=91000 s=35697

Sum = 0 Sum = 0

Recently the mean tuition at 4-yr public collegesuniversities in the US was $6185 with a standard deviation of $1804 In NC the mean tuition was $4320 What is NCrsquos z-score

1 103

2 -103

3 239

4 1865

5 -1865

Section 34Measures of Position (also called Measures of Relative Standing)

Quartiles

5-Number Summary

Interquartile Range Another Measure of Spread

Boxplots

m = median = 34

Q1= first quartile = 23

Q3= third quartile = 42

1 1 062 2 123 3 164 4 195 5 156 6 217 7 238 6 239 5 2510 4 2811 3 2912 2 3313 1 3414 2 3615 3 3716 4 3817 5 3918 6 4119 7 4220 6 4521 5 4722 4 4923 3 5324 2 5625 1 61

Quartiles Measuring spread by examining the middleThe first quartile Q1 is the value in the

sample that has 25 of the data at or

below it (Q1 is the median of the lower

half of the sorted data)

The third quartile Q3 is the value in the

sample that has 75 of the data at or

below it (Q3 is the median of the upper

half of the sorted data)

Quartiles and median divide data into 4 pieces

Q1 M Q3

14 14 14 14

Quartiles are common measures of spread

httpoirpncsueduiradmit

httpoirpncsueduunivpeer

University of Southern California

Economic Value of College Majors

Rules for Calculating QuartilesStep 1 find the median of all the data (the median divides the data in half)

Step 2a find the median of the lower half this median is Q1Step 2b find the median of the upper half this median is Q3

Importantwhen n is odd include the overall median in both halveswhen n is even do not include the overall median in either half

Example 2 4 6 8 10 12 14 16 18 20 n = 10

Median m = (10+12)2 = 222 = 11

Q1 median of lower half 2 4 6 8 10

Q1 = 6

Q3 median of upper half 12 14 16 18 20

Q3 = 16

11

Pulse Rates n = 138

Stem Leaves4

3 4 5889 5 00123344410 5 555678889923 6 0001111112223333334444423 6 5555666666777778888888816 7 0000011222233444423 7 5555566666677788888899910 8 000011222410 8 55556677894 9 00122 9 584 10 0223

101 11 1

Median mean of pulses in locations 69 amp 70 median= (70+70)2=70

Q1 median of lower half (lower half = 69 smallest pulses) Q1 = pulse in ordered position 35Q1 = 63

Q3 median of upper half (upper half = 69 largest pulses) Q3= pulse in position 35 from the high end Q3=78

Below are the weights of 31 linemen on the NCSU football team What is the

value of the first quartile Q1

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 287

2 2575

3 2635

4 2625

Interquartile range another measure of spread

lower quartile Q1

middle quartile median upper quartile Q3

interquartile range (IQR)

IQR = Q3 ndash Q1

measures spread of middle 50 of the data

Example beginning pulse rates

Q3 = 78 Q1 = 63

IQR = 78 ndash 63 = 15

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 57: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Mean balance pointMedian 50 area each half

mean 5526 years median 577years

Medians are used often

Year 2011 baseball salaries

Median $1450000 (max=$32000000 Alex Rodriguez min=$414000)

Median fan age MLB 45 NFL 43 NBA 41 NHL 39

Median existing home sales price May 2011 $166500 May 2010 $174600

Median household income (2008 dollars) 2009 $50221 2008 $52029

Examples Example n = 7

175 28 32 139 141 253 458 Example n = 7 (ordered) 28 32 139 141 175 253 458 Example n = 8

175 28 32 139 141 253 357 458

Example n =8 (ordered)

28 32 139 141 175 253 357 458

m = 141

m = (141+175)2 = 158

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496049604971524555467586

1 5245

2 49655

3 4960

4 4971

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496052455546497155877586

1 5245

2 49655

3 5546

4 4971

Properties of Mean Median1The mean and median are unique that is a

data set has only 1 mean and 1 median (the mean and median are not necessarily equal)

2The mean uses the value of every number in the data set the median does not

14

20 4 6Ex 2 4 6 8 5 5

4 2

21 4 6Ex 2 4 6 9 5 5

4 2

x m

x m

Example class pulse rates

53 64 67 67 70 76 77 77 78 83 84 85 85 89 90 90 90 90 91 96 98 103 140

23

1

23

844823

location 12th obs 85

ii

n

xx

m m

2010 2014 baseball salaries

2010

n = 845

mean = $3297828

median = $1330000

max = $33000000

2014

n = 848

mean = $3932912

median = $1456250

max = $28000000

>

Disadvantage of the mean

Can be greatly influenced by just a few observations that are much greater or much smaller than the rest of the data

Mean Median Maximum Baseball Salaries 1985 - 201419

85

1987

1989

1991

1993

1995

1997

1999

2001

2003

2005

2007

2009

2011

2013

200000

700000

1200000

1700000

2200000

2700000

3200000

3700000

0

5000000

10000000

15000000

20000000

25000000

30000000

35000000

Baseball Salaries Mean Median and Maximum 1985-2014

Mean Median Maximum

Year

Mea

n M

edia

n S

alar

y

Max

imu

m S

alar

y

Skewness comparing the mean and median

Skewed to the right (positively skewed) meangtmedian

53

490

102 7235 21 26 17 8 10 2 3 1 0 0 1

0

100

200

300

400

500

600

Freq

uenc

y

Salary ($1000s)

2011 Baseball Salaries

Skewed to the left negatively skewed

Mean lt median mean=78 median=87

Histogram of Exam Scores

0

10

20

30

20 30 40 50 60 70 80 90 100Exam Scores

Fre

qu

en

cy

Symmetric data

mean median approx equal

Bank Customers 1000-1100 am

0

5

10

15

20

Number of Customers

Fre

qu

en

cy

Section 33Describing Variability of Data

Standard Deviation

Using the Mean and Standard Deviation Together 68-95-997

Rule (Empirical Rule)

Recall 2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability

measures how ldquospread outrdquo the data is

Ways to measure variability

1 range=largest-smallest

ok sometimes in general too crude sensitive to one large or small obs

1

2 where

the middle is the mean

deviation of from the mean

( ) sum the deviations of all the s from

measure spread from the middle

i i

n

i ii

y

y y y

y y y y

1

( ) 0 always tells us nothingn

ii

y y

Example

1 2

1 2

1 2

1 2

sum of deviations from mean

49 51 50

( ) ( ) (49 50) (51 50) 1 1 0

0 100

Data set 1

Data set 2 50

( ) ( ) (0 50) (100 50) 50 50 0

x x x

x x x x

y y y

y y y y

The Sample Standard Deviation a measure of spread around the mean Square the deviation of each

observation from the mean find the square root of the ldquoaveragerdquo of these squared deviations

2

1

2

2 1

( )sample standard deviation

1

( )is called the sample variance

1

n

ii

n

ii

y ys

n

y ys

n

Calculations hellip

Mean = 634

Sum of squared deviations from mean = 852

(n minus 1) = 13 (n minus 1) is called degrees freedom (df)

s2 = variance = 85213 = 655 square inches

s = standard deviation = radic655 = 256 inches

Women height (inches)i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

2

1

2 )(1

1xx

ns

n

i

1 First calculate the variance s22 Then take the square root to get the

standard deviation s

2

1

)(1

1xx

ns

n

i

Meanplusmn 1 sd

Wersquoll never calculate these by hand so make sure to know how to get the standard deviation using your calculator Excel or other software

Population Standard Deviation

2

1

Denoted by the lower case Greek letter

is the size (for example =34000 for NCSU)

is the mean

( )population standard deviation

va

po

lue of typically not known

us

pulation

populatio

e

n

N

ii

N N

y

N

s

to estimate value of

Remarks

1 The standard deviation of a set of measurements is an estimate of the likely size of the chance error in a single measurement

Remarks (cont)

2 Note that s and s are always greater than or equal to zero

3 The larger the value of s (or s ) the greater the spread of the data

When does s=0 When does s =0

When all data values are the same

Remarks (cont)4 The standard deviation is the most

commonly used measure of risk in finance and businessndash Stocks Mutual Funds etc

5 Variance s2 sample variance 2 population variance Units are squared units of the original data square $ square gallons

Review Properties of s and s s and s are always greater than or

equal to 0

when does s = 0 s = 0 The larger the value of s (or s) the

greater the spread of the data the standard deviation of a set of

measurements is an estimate of the likely size of the chance error in a single measurement

Summary of Notation

2

SAMPLE

sample mean

sample median

sample variance

sample stand dev

y

m

s

s

2

POPULATION

population mean

population median

population variance

population stand dev

m

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

68-95-997 rule

Mean andStandard Deviation

(numerical)

Histogram(graphical)

68-95-997 rule

The 68-95-997 ruleIf the histogram of the data is

approximately bell-shaped then1) approximately of the measurements

are of the mean

that is in ( )

2) approximately of the measurement

68

within 1 standard deviation

95

within 2 standard deviation

s

are of the meas n

that is

y s y s

almost all

within 3 standard deviation

in ( 2 2 )

3) the measurements

are of the mean

that is in ( 3 3 )

s

y s y s

y s y s

68-95-997 rule 68 within 1 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

68

3434

y-s y y+s

68-95-997 rule 95 within 2 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

95

475 475

y-2s y y+2s

Example textbook costs

37548

4272

50

y

s

n

286 291 307 308 315 316 327328 340 342 346 347 348 348 349354 355 355 360 361 364 367 369371 373 377 380 381 382 385 385387 390 390 397 398 409 409 410418 422 424 425 426 428 433 434437 440 480

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( ) (33276 41820)

32percentage of data values in this interval 64

5068-95-997 rule 68

y s

y s y s

1 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 2 2 ) (29004 46092)

48percentage of data values in this interval 96

5068-95-997 rule 95

y s

y s y s

2 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 3 3 ) (24732 50364)

50percentage of data values in this interval 100

5068-95-997 rule 997

y s

y s y s

3 standard deviation interval about the mean

The best estimate of the standard deviation of the menrsquos weights

displayed in this dotplot is

1 10

2 15

3 20

4 40

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

Preceding slides Next

Z-scores Standardized Data Values

Measures the distance of a number from the mean in units of

the standard deviation

z-score corresponding to y

where

original data value

the sample mean

s the sample standard deviation

the z-score corresponding to

y yz

s

y

y

z y

Exam 1 y1 = 88 s1 = 6 your exam 1 score 91

Exam 2 y2 = 88 s2 = 10 your exam 2 score 92

Which score is better

1

2

91 88 3z 5

6 692 88 4

z 410 10

91 on exam 1 is better than 92 on exam 2

If data has mean and standard deviation

then standardizing a particular value of

indicates how many standard deviations

is above or below the mean

y s

y

y

y

Comparing SAT and ACT Scores

SAT Math Eleanorrsquos score 680

SAT mean =500 sd=100 ACT Math Geraldrsquos score 27

ACT mean=18 sd=6 Eleanorrsquos z-score z=(680-500)100=18 Geraldrsquos z-score z=(27-18)6=15 Eleanorrsquos score is better

Z-scores add to zeroStudentInstitutional Support to Athletic Depts For the 9 Public ACC

Schools 2013 ($ millions)

School Support y - ybar Z-score

Maryland 155 64 179

UVA 131 40 112

Louisville 109 18 050

UNC 92 01 003

VaTech 79 -12 -034

FSU 79 -12 -034

GaTech 71 -20 -056

NCSU 65 -26 -073

Clemson 38 -53 -147

Mean=91000 s=35697

Sum = 0 Sum = 0

Recently the mean tuition at 4-yr public collegesuniversities in the US was $6185 with a standard deviation of $1804 In NC the mean tuition was $4320 What is NCrsquos z-score

1 103

2 -103

3 239

4 1865

5 -1865

Section 34Measures of Position (also called Measures of Relative Standing)

Quartiles

5-Number Summary

Interquartile Range Another Measure of Spread

Boxplots

m = median = 34

Q1= first quartile = 23

Q3= third quartile = 42

1 1 062 2 123 3 164 4 195 5 156 6 217 7 238 6 239 5 2510 4 2811 3 2912 2 3313 1 3414 2 3615 3 3716 4 3817 5 3918 6 4119 7 4220 6 4521 5 4722 4 4923 3 5324 2 5625 1 61

Quartiles Measuring spread by examining the middleThe first quartile Q1 is the value in the

sample that has 25 of the data at or

below it (Q1 is the median of the lower

half of the sorted data)

The third quartile Q3 is the value in the

sample that has 75 of the data at or

below it (Q3 is the median of the upper

half of the sorted data)

Quartiles and median divide data into 4 pieces

Q1 M Q3

14 14 14 14

Quartiles are common measures of spread

httpoirpncsueduiradmit

httpoirpncsueduunivpeer

University of Southern California

Economic Value of College Majors

Rules for Calculating QuartilesStep 1 find the median of all the data (the median divides the data in half)

Step 2a find the median of the lower half this median is Q1Step 2b find the median of the upper half this median is Q3

Importantwhen n is odd include the overall median in both halveswhen n is even do not include the overall median in either half

Example 2 4 6 8 10 12 14 16 18 20 n = 10

Median m = (10+12)2 = 222 = 11

Q1 median of lower half 2 4 6 8 10

Q1 = 6

Q3 median of upper half 12 14 16 18 20

Q3 = 16

11

Pulse Rates n = 138

Stem Leaves4

3 4 5889 5 00123344410 5 555678889923 6 0001111112223333334444423 6 5555666666777778888888816 7 0000011222233444423 7 5555566666677788888899910 8 000011222410 8 55556677894 9 00122 9 584 10 0223

101 11 1

Median mean of pulses in locations 69 amp 70 median= (70+70)2=70

Q1 median of lower half (lower half = 69 smallest pulses) Q1 = pulse in ordered position 35Q1 = 63

Q3 median of upper half (upper half = 69 largest pulses) Q3= pulse in position 35 from the high end Q3=78

Below are the weights of 31 linemen on the NCSU football team What is the

value of the first quartile Q1

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 287

2 2575

3 2635

4 2625

Interquartile range another measure of spread

lower quartile Q1

middle quartile median upper quartile Q3

interquartile range (IQR)

IQR = Q3 ndash Q1

measures spread of middle 50 of the data

Example beginning pulse rates

Q3 = 78 Q1 = 63

IQR = 78 ndash 63 = 15

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 58: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Medians are used often

Year 2011 baseball salaries

Median $1450000 (max=$32000000 Alex Rodriguez min=$414000)

Median fan age MLB 45 NFL 43 NBA 41 NHL 39

Median existing home sales price May 2011 $166500 May 2010 $174600

Median household income (2008 dollars) 2009 $50221 2008 $52029

Examples Example n = 7

175 28 32 139 141 253 458 Example n = 7 (ordered) 28 32 139 141 175 253 458 Example n = 8

175 28 32 139 141 253 357 458

Example n =8 (ordered)

28 32 139 141 175 253 357 458

m = 141

m = (141+175)2 = 158

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496049604971524555467586

1 5245

2 49655

3 4960

4 4971

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496052455546497155877586

1 5245

2 49655

3 5546

4 4971

Properties of Mean Median1The mean and median are unique that is a

data set has only 1 mean and 1 median (the mean and median are not necessarily equal)

2The mean uses the value of every number in the data set the median does not

14

20 4 6Ex 2 4 6 8 5 5

4 2

21 4 6Ex 2 4 6 9 5 5

4 2

x m

x m

Example class pulse rates

53 64 67 67 70 76 77 77 78 83 84 85 85 89 90 90 90 90 91 96 98 103 140

23

1

23

844823

location 12th obs 85

ii

n

xx

m m

2010 2014 baseball salaries

2010

n = 845

mean = $3297828

median = $1330000

max = $33000000

2014

n = 848

mean = $3932912

median = $1456250

max = $28000000

>

Disadvantage of the mean

Can be greatly influenced by just a few observations that are much greater or much smaller than the rest of the data

Mean Median Maximum Baseball Salaries 1985 - 201419

85

1987

1989

1991

1993

1995

1997

1999

2001

2003

2005

2007

2009

2011

2013

200000

700000

1200000

1700000

2200000

2700000

3200000

3700000

0

5000000

10000000

15000000

20000000

25000000

30000000

35000000

Baseball Salaries Mean Median and Maximum 1985-2014

Mean Median Maximum

Year

Mea

n M

edia

n S

alar

y

Max

imu

m S

alar

y

Skewness comparing the mean and median

Skewed to the right (positively skewed) meangtmedian

53

490

102 7235 21 26 17 8 10 2 3 1 0 0 1

0

100

200

300

400

500

600

Freq

uenc

y

Salary ($1000s)

2011 Baseball Salaries

Skewed to the left negatively skewed

Mean lt median mean=78 median=87

Histogram of Exam Scores

0

10

20

30

20 30 40 50 60 70 80 90 100Exam Scores

Fre

qu

en

cy

Symmetric data

mean median approx equal

Bank Customers 1000-1100 am

0

5

10

15

20

Number of Customers

Fre

qu

en

cy

Section 33Describing Variability of Data

Standard Deviation

Using the Mean and Standard Deviation Together 68-95-997

Rule (Empirical Rule)

Recall 2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability

measures how ldquospread outrdquo the data is

Ways to measure variability

1 range=largest-smallest

ok sometimes in general too crude sensitive to one large or small obs

1

2 where

the middle is the mean

deviation of from the mean

( ) sum the deviations of all the s from

measure spread from the middle

i i

n

i ii

y

y y y

y y y y

1

( ) 0 always tells us nothingn

ii

y y

Example

1 2

1 2

1 2

1 2

sum of deviations from mean

49 51 50

( ) ( ) (49 50) (51 50) 1 1 0

0 100

Data set 1

Data set 2 50

( ) ( ) (0 50) (100 50) 50 50 0

x x x

x x x x

y y y

y y y y

The Sample Standard Deviation a measure of spread around the mean Square the deviation of each

observation from the mean find the square root of the ldquoaveragerdquo of these squared deviations

2

1

2

2 1

( )sample standard deviation

1

( )is called the sample variance

1

n

ii

n

ii

y ys

n

y ys

n

Calculations hellip

Mean = 634

Sum of squared deviations from mean = 852

(n minus 1) = 13 (n minus 1) is called degrees freedom (df)

s2 = variance = 85213 = 655 square inches

s = standard deviation = radic655 = 256 inches

Women height (inches)i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

2

1

2 )(1

1xx

ns

n

i

1 First calculate the variance s22 Then take the square root to get the

standard deviation s

2

1

)(1

1xx

ns

n

i

Meanplusmn 1 sd

Wersquoll never calculate these by hand so make sure to know how to get the standard deviation using your calculator Excel or other software

Population Standard Deviation

2

1

Denoted by the lower case Greek letter

is the size (for example =34000 for NCSU)

is the mean

( )population standard deviation

va

po

lue of typically not known

us

pulation

populatio

e

n

N

ii

N N

y

N

s

to estimate value of

Remarks

1 The standard deviation of a set of measurements is an estimate of the likely size of the chance error in a single measurement

Remarks (cont)

2 Note that s and s are always greater than or equal to zero

3 The larger the value of s (or s ) the greater the spread of the data

When does s=0 When does s =0

When all data values are the same

Remarks (cont)4 The standard deviation is the most

commonly used measure of risk in finance and businessndash Stocks Mutual Funds etc

5 Variance s2 sample variance 2 population variance Units are squared units of the original data square $ square gallons

Review Properties of s and s s and s are always greater than or

equal to 0

when does s = 0 s = 0 The larger the value of s (or s) the

greater the spread of the data the standard deviation of a set of

measurements is an estimate of the likely size of the chance error in a single measurement

Summary of Notation

2

SAMPLE

sample mean

sample median

sample variance

sample stand dev

y

m

s

s

2

POPULATION

population mean

population median

population variance

population stand dev

m

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

68-95-997 rule

Mean andStandard Deviation

(numerical)

Histogram(graphical)

68-95-997 rule

The 68-95-997 ruleIf the histogram of the data is

approximately bell-shaped then1) approximately of the measurements

are of the mean

that is in ( )

2) approximately of the measurement

68

within 1 standard deviation

95

within 2 standard deviation

s

are of the meas n

that is

y s y s

almost all

within 3 standard deviation

in ( 2 2 )

3) the measurements

are of the mean

that is in ( 3 3 )

s

y s y s

y s y s

68-95-997 rule 68 within 1 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

68

3434

y-s y y+s

68-95-997 rule 95 within 2 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

95

475 475

y-2s y y+2s

Example textbook costs

37548

4272

50

y

s

n

286 291 307 308 315 316 327328 340 342 346 347 348 348 349354 355 355 360 361 364 367 369371 373 377 380 381 382 385 385387 390 390 397 398 409 409 410418 422 424 425 426 428 433 434437 440 480

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( ) (33276 41820)

32percentage of data values in this interval 64

5068-95-997 rule 68

y s

y s y s

1 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 2 2 ) (29004 46092)

48percentage of data values in this interval 96

5068-95-997 rule 95

y s

y s y s

2 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 3 3 ) (24732 50364)

50percentage of data values in this interval 100

5068-95-997 rule 997

y s

y s y s

3 standard deviation interval about the mean

The best estimate of the standard deviation of the menrsquos weights

displayed in this dotplot is

1 10

2 15

3 20

4 40

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

Preceding slides Next

Z-scores Standardized Data Values

Measures the distance of a number from the mean in units of

the standard deviation

z-score corresponding to y

where

original data value

the sample mean

s the sample standard deviation

the z-score corresponding to

y yz

s

y

y

z y

Exam 1 y1 = 88 s1 = 6 your exam 1 score 91

Exam 2 y2 = 88 s2 = 10 your exam 2 score 92

Which score is better

1

2

91 88 3z 5

6 692 88 4

z 410 10

91 on exam 1 is better than 92 on exam 2

If data has mean and standard deviation

then standardizing a particular value of

indicates how many standard deviations

is above or below the mean

y s

y

y

y

Comparing SAT and ACT Scores

SAT Math Eleanorrsquos score 680

SAT mean =500 sd=100 ACT Math Geraldrsquos score 27

ACT mean=18 sd=6 Eleanorrsquos z-score z=(680-500)100=18 Geraldrsquos z-score z=(27-18)6=15 Eleanorrsquos score is better

Z-scores add to zeroStudentInstitutional Support to Athletic Depts For the 9 Public ACC

Schools 2013 ($ millions)

School Support y - ybar Z-score

Maryland 155 64 179

UVA 131 40 112

Louisville 109 18 050

UNC 92 01 003

VaTech 79 -12 -034

FSU 79 -12 -034

GaTech 71 -20 -056

NCSU 65 -26 -073

Clemson 38 -53 -147

Mean=91000 s=35697

Sum = 0 Sum = 0

Recently the mean tuition at 4-yr public collegesuniversities in the US was $6185 with a standard deviation of $1804 In NC the mean tuition was $4320 What is NCrsquos z-score

1 103

2 -103

3 239

4 1865

5 -1865

Section 34Measures of Position (also called Measures of Relative Standing)

Quartiles

5-Number Summary

Interquartile Range Another Measure of Spread

Boxplots

m = median = 34

Q1= first quartile = 23

Q3= third quartile = 42

1 1 062 2 123 3 164 4 195 5 156 6 217 7 238 6 239 5 2510 4 2811 3 2912 2 3313 1 3414 2 3615 3 3716 4 3817 5 3918 6 4119 7 4220 6 4521 5 4722 4 4923 3 5324 2 5625 1 61

Quartiles Measuring spread by examining the middleThe first quartile Q1 is the value in the

sample that has 25 of the data at or

below it (Q1 is the median of the lower

half of the sorted data)

The third quartile Q3 is the value in the

sample that has 75 of the data at or

below it (Q3 is the median of the upper

half of the sorted data)

Quartiles and median divide data into 4 pieces

Q1 M Q3

14 14 14 14

Quartiles are common measures of spread

httpoirpncsueduiradmit

httpoirpncsueduunivpeer

University of Southern California

Economic Value of College Majors

Rules for Calculating QuartilesStep 1 find the median of all the data (the median divides the data in half)

Step 2a find the median of the lower half this median is Q1Step 2b find the median of the upper half this median is Q3

Importantwhen n is odd include the overall median in both halveswhen n is even do not include the overall median in either half

Example 2 4 6 8 10 12 14 16 18 20 n = 10

Median m = (10+12)2 = 222 = 11

Q1 median of lower half 2 4 6 8 10

Q1 = 6

Q3 median of upper half 12 14 16 18 20

Q3 = 16

11

Pulse Rates n = 138

Stem Leaves4

3 4 5889 5 00123344410 5 555678889923 6 0001111112223333334444423 6 5555666666777778888888816 7 0000011222233444423 7 5555566666677788888899910 8 000011222410 8 55556677894 9 00122 9 584 10 0223

101 11 1

Median mean of pulses in locations 69 amp 70 median= (70+70)2=70

Q1 median of lower half (lower half = 69 smallest pulses) Q1 = pulse in ordered position 35Q1 = 63

Q3 median of upper half (upper half = 69 largest pulses) Q3= pulse in position 35 from the high end Q3=78

Below are the weights of 31 linemen on the NCSU football team What is the

value of the first quartile Q1

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 287

2 2575

3 2635

4 2625

Interquartile range another measure of spread

lower quartile Q1

middle quartile median upper quartile Q3

interquartile range (IQR)

IQR = Q3 ndash Q1

measures spread of middle 50 of the data

Example beginning pulse rates

Q3 = 78 Q1 = 63

IQR = 78 ndash 63 = 15

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 59: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Examples Example n = 7

175 28 32 139 141 253 458 Example n = 7 (ordered) 28 32 139 141 175 253 458 Example n = 8

175 28 32 139 141 253 357 458

Example n =8 (ordered)

28 32 139 141 175 253 357 458

m = 141

m = (141+175)2 = 158

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496049604971524555467586

1 5245

2 49655

3 4960

4 4971

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496052455546497155877586

1 5245

2 49655

3 5546

4 4971

Properties of Mean Median1The mean and median are unique that is a

data set has only 1 mean and 1 median (the mean and median are not necessarily equal)

2The mean uses the value of every number in the data set the median does not

14

20 4 6Ex 2 4 6 8 5 5

4 2

21 4 6Ex 2 4 6 9 5 5

4 2

x m

x m

Example class pulse rates

53 64 67 67 70 76 77 77 78 83 84 85 85 89 90 90 90 90 91 96 98 103 140

23

1

23

844823

location 12th obs 85

ii

n

xx

m m

2010 2014 baseball salaries

2010

n = 845

mean = $3297828

median = $1330000

max = $33000000

2014

n = 848

mean = $3932912

median = $1456250

max = $28000000

>

Disadvantage of the mean

Can be greatly influenced by just a few observations that are much greater or much smaller than the rest of the data

Mean Median Maximum Baseball Salaries 1985 - 201419

85

1987

1989

1991

1993

1995

1997

1999

2001

2003

2005

2007

2009

2011

2013

200000

700000

1200000

1700000

2200000

2700000

3200000

3700000

0

5000000

10000000

15000000

20000000

25000000

30000000

35000000

Baseball Salaries Mean Median and Maximum 1985-2014

Mean Median Maximum

Year

Mea

n M

edia

n S

alar

y

Max

imu

m S

alar

y

Skewness comparing the mean and median

Skewed to the right (positively skewed) meangtmedian

53

490

102 7235 21 26 17 8 10 2 3 1 0 0 1

0

100

200

300

400

500

600

Freq

uenc

y

Salary ($1000s)

2011 Baseball Salaries

Skewed to the left negatively skewed

Mean lt median mean=78 median=87

Histogram of Exam Scores

0

10

20

30

20 30 40 50 60 70 80 90 100Exam Scores

Fre

qu

en

cy

Symmetric data

mean median approx equal

Bank Customers 1000-1100 am

0

5

10

15

20

Number of Customers

Fre

qu

en

cy

Section 33Describing Variability of Data

Standard Deviation

Using the Mean and Standard Deviation Together 68-95-997

Rule (Empirical Rule)

Recall 2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability

measures how ldquospread outrdquo the data is

Ways to measure variability

1 range=largest-smallest

ok sometimes in general too crude sensitive to one large or small obs

1

2 where

the middle is the mean

deviation of from the mean

( ) sum the deviations of all the s from

measure spread from the middle

i i

n

i ii

y

y y y

y y y y

1

( ) 0 always tells us nothingn

ii

y y

Example

1 2

1 2

1 2

1 2

sum of deviations from mean

49 51 50

( ) ( ) (49 50) (51 50) 1 1 0

0 100

Data set 1

Data set 2 50

( ) ( ) (0 50) (100 50) 50 50 0

x x x

x x x x

y y y

y y y y

The Sample Standard Deviation a measure of spread around the mean Square the deviation of each

observation from the mean find the square root of the ldquoaveragerdquo of these squared deviations

2

1

2

2 1

( )sample standard deviation

1

( )is called the sample variance

1

n

ii

n

ii

y ys

n

y ys

n

Calculations hellip

Mean = 634

Sum of squared deviations from mean = 852

(n minus 1) = 13 (n minus 1) is called degrees freedom (df)

s2 = variance = 85213 = 655 square inches

s = standard deviation = radic655 = 256 inches

Women height (inches)i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

2

1

2 )(1

1xx

ns

n

i

1 First calculate the variance s22 Then take the square root to get the

standard deviation s

2

1

)(1

1xx

ns

n

i

Meanplusmn 1 sd

Wersquoll never calculate these by hand so make sure to know how to get the standard deviation using your calculator Excel or other software

Population Standard Deviation

2

1

Denoted by the lower case Greek letter

is the size (for example =34000 for NCSU)

is the mean

( )population standard deviation

va

po

lue of typically not known

us

pulation

populatio

e

n

N

ii

N N

y

N

s

to estimate value of

Remarks

1 The standard deviation of a set of measurements is an estimate of the likely size of the chance error in a single measurement

Remarks (cont)

2 Note that s and s are always greater than or equal to zero

3 The larger the value of s (or s ) the greater the spread of the data

When does s=0 When does s =0

When all data values are the same

Remarks (cont)4 The standard deviation is the most

commonly used measure of risk in finance and businessndash Stocks Mutual Funds etc

5 Variance s2 sample variance 2 population variance Units are squared units of the original data square $ square gallons

Review Properties of s and s s and s are always greater than or

equal to 0

when does s = 0 s = 0 The larger the value of s (or s) the

greater the spread of the data the standard deviation of a set of

measurements is an estimate of the likely size of the chance error in a single measurement

Summary of Notation

2

SAMPLE

sample mean

sample median

sample variance

sample stand dev

y

m

s

s

2

POPULATION

population mean

population median

population variance

population stand dev

m

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

68-95-997 rule

Mean andStandard Deviation

(numerical)

Histogram(graphical)

68-95-997 rule

The 68-95-997 ruleIf the histogram of the data is

approximately bell-shaped then1) approximately of the measurements

are of the mean

that is in ( )

2) approximately of the measurement

68

within 1 standard deviation

95

within 2 standard deviation

s

are of the meas n

that is

y s y s

almost all

within 3 standard deviation

in ( 2 2 )

3) the measurements

are of the mean

that is in ( 3 3 )

s

y s y s

y s y s

68-95-997 rule 68 within 1 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

68

3434

y-s y y+s

68-95-997 rule 95 within 2 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

95

475 475

y-2s y y+2s

Example textbook costs

37548

4272

50

y

s

n

286 291 307 308 315 316 327328 340 342 346 347 348 348 349354 355 355 360 361 364 367 369371 373 377 380 381 382 385 385387 390 390 397 398 409 409 410418 422 424 425 426 428 433 434437 440 480

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( ) (33276 41820)

32percentage of data values in this interval 64

5068-95-997 rule 68

y s

y s y s

1 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 2 2 ) (29004 46092)

48percentage of data values in this interval 96

5068-95-997 rule 95

y s

y s y s

2 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 3 3 ) (24732 50364)

50percentage of data values in this interval 100

5068-95-997 rule 997

y s

y s y s

3 standard deviation interval about the mean

The best estimate of the standard deviation of the menrsquos weights

displayed in this dotplot is

1 10

2 15

3 20

4 40

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

Preceding slides Next

Z-scores Standardized Data Values

Measures the distance of a number from the mean in units of

the standard deviation

z-score corresponding to y

where

original data value

the sample mean

s the sample standard deviation

the z-score corresponding to

y yz

s

y

y

z y

Exam 1 y1 = 88 s1 = 6 your exam 1 score 91

Exam 2 y2 = 88 s2 = 10 your exam 2 score 92

Which score is better

1

2

91 88 3z 5

6 692 88 4

z 410 10

91 on exam 1 is better than 92 on exam 2

If data has mean and standard deviation

then standardizing a particular value of

indicates how many standard deviations

is above or below the mean

y s

y

y

y

Comparing SAT and ACT Scores

SAT Math Eleanorrsquos score 680

SAT mean =500 sd=100 ACT Math Geraldrsquos score 27

ACT mean=18 sd=6 Eleanorrsquos z-score z=(680-500)100=18 Geraldrsquos z-score z=(27-18)6=15 Eleanorrsquos score is better

Z-scores add to zeroStudentInstitutional Support to Athletic Depts For the 9 Public ACC

Schools 2013 ($ millions)

School Support y - ybar Z-score

Maryland 155 64 179

UVA 131 40 112

Louisville 109 18 050

UNC 92 01 003

VaTech 79 -12 -034

FSU 79 -12 -034

GaTech 71 -20 -056

NCSU 65 -26 -073

Clemson 38 -53 -147

Mean=91000 s=35697

Sum = 0 Sum = 0

Recently the mean tuition at 4-yr public collegesuniversities in the US was $6185 with a standard deviation of $1804 In NC the mean tuition was $4320 What is NCrsquos z-score

1 103

2 -103

3 239

4 1865

5 -1865

Section 34Measures of Position (also called Measures of Relative Standing)

Quartiles

5-Number Summary

Interquartile Range Another Measure of Spread

Boxplots

m = median = 34

Q1= first quartile = 23

Q3= third quartile = 42

1 1 062 2 123 3 164 4 195 5 156 6 217 7 238 6 239 5 2510 4 2811 3 2912 2 3313 1 3414 2 3615 3 3716 4 3817 5 3918 6 4119 7 4220 6 4521 5 4722 4 4923 3 5324 2 5625 1 61

Quartiles Measuring spread by examining the middleThe first quartile Q1 is the value in the

sample that has 25 of the data at or

below it (Q1 is the median of the lower

half of the sorted data)

The third quartile Q3 is the value in the

sample that has 75 of the data at or

below it (Q3 is the median of the upper

half of the sorted data)

Quartiles and median divide data into 4 pieces

Q1 M Q3

14 14 14 14

Quartiles are common measures of spread

httpoirpncsueduiradmit

httpoirpncsueduunivpeer

University of Southern California

Economic Value of College Majors

Rules for Calculating QuartilesStep 1 find the median of all the data (the median divides the data in half)

Step 2a find the median of the lower half this median is Q1Step 2b find the median of the upper half this median is Q3

Importantwhen n is odd include the overall median in both halveswhen n is even do not include the overall median in either half

Example 2 4 6 8 10 12 14 16 18 20 n = 10

Median m = (10+12)2 = 222 = 11

Q1 median of lower half 2 4 6 8 10

Q1 = 6

Q3 median of upper half 12 14 16 18 20

Q3 = 16

11

Pulse Rates n = 138

Stem Leaves4

3 4 5889 5 00123344410 5 555678889923 6 0001111112223333334444423 6 5555666666777778888888816 7 0000011222233444423 7 5555566666677788888899910 8 000011222410 8 55556677894 9 00122 9 584 10 0223

101 11 1

Median mean of pulses in locations 69 amp 70 median= (70+70)2=70

Q1 median of lower half (lower half = 69 smallest pulses) Q1 = pulse in ordered position 35Q1 = 63

Q3 median of upper half (upper half = 69 largest pulses) Q3= pulse in position 35 from the high end Q3=78

Below are the weights of 31 linemen on the NCSU football team What is the

value of the first quartile Q1

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 287

2 2575

3 2635

4 2625

Interquartile range another measure of spread

lower quartile Q1

middle quartile median upper quartile Q3

interquartile range (IQR)

IQR = Q3 ndash Q1

measures spread of middle 50 of the data

Example beginning pulse rates

Q3 = 78 Q1 = 63

IQR = 78 ndash 63 = 15

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 60: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496049604971524555467586

1 5245

2 49655

3 4960

4 4971

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496052455546497155877586

1 5245

2 49655

3 5546

4 4971

Properties of Mean Median1The mean and median are unique that is a

data set has only 1 mean and 1 median (the mean and median are not necessarily equal)

2The mean uses the value of every number in the data set the median does not

14

20 4 6Ex 2 4 6 8 5 5

4 2

21 4 6Ex 2 4 6 9 5 5

4 2

x m

x m

Example class pulse rates

53 64 67 67 70 76 77 77 78 83 84 85 85 89 90 90 90 90 91 96 98 103 140

23

1

23

844823

location 12th obs 85

ii

n

xx

m m

2010 2014 baseball salaries

2010

n = 845

mean = $3297828

median = $1330000

max = $33000000

2014

n = 848

mean = $3932912

median = $1456250

max = $28000000

>

Disadvantage of the mean

Can be greatly influenced by just a few observations that are much greater or much smaller than the rest of the data

Mean Median Maximum Baseball Salaries 1985 - 201419

85

1987

1989

1991

1993

1995

1997

1999

2001

2003

2005

2007

2009

2011

2013

200000

700000

1200000

1700000

2200000

2700000

3200000

3700000

0

5000000

10000000

15000000

20000000

25000000

30000000

35000000

Baseball Salaries Mean Median and Maximum 1985-2014

Mean Median Maximum

Year

Mea

n M

edia

n S

alar

y

Max

imu

m S

alar

y

Skewness comparing the mean and median

Skewed to the right (positively skewed) meangtmedian

53

490

102 7235 21 26 17 8 10 2 3 1 0 0 1

0

100

200

300

400

500

600

Freq

uenc

y

Salary ($1000s)

2011 Baseball Salaries

Skewed to the left negatively skewed

Mean lt median mean=78 median=87

Histogram of Exam Scores

0

10

20

30

20 30 40 50 60 70 80 90 100Exam Scores

Fre

qu

en

cy

Symmetric data

mean median approx equal

Bank Customers 1000-1100 am

0

5

10

15

20

Number of Customers

Fre

qu

en

cy

Section 33Describing Variability of Data

Standard Deviation

Using the Mean and Standard Deviation Together 68-95-997

Rule (Empirical Rule)

Recall 2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability

measures how ldquospread outrdquo the data is

Ways to measure variability

1 range=largest-smallest

ok sometimes in general too crude sensitive to one large or small obs

1

2 where

the middle is the mean

deviation of from the mean

( ) sum the deviations of all the s from

measure spread from the middle

i i

n

i ii

y

y y y

y y y y

1

( ) 0 always tells us nothingn

ii

y y

Example

1 2

1 2

1 2

1 2

sum of deviations from mean

49 51 50

( ) ( ) (49 50) (51 50) 1 1 0

0 100

Data set 1

Data set 2 50

( ) ( ) (0 50) (100 50) 50 50 0

x x x

x x x x

y y y

y y y y

The Sample Standard Deviation a measure of spread around the mean Square the deviation of each

observation from the mean find the square root of the ldquoaveragerdquo of these squared deviations

2

1

2

2 1

( )sample standard deviation

1

( )is called the sample variance

1

n

ii

n

ii

y ys

n

y ys

n

Calculations hellip

Mean = 634

Sum of squared deviations from mean = 852

(n minus 1) = 13 (n minus 1) is called degrees freedom (df)

s2 = variance = 85213 = 655 square inches

s = standard deviation = radic655 = 256 inches

Women height (inches)i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

2

1

2 )(1

1xx

ns

n

i

1 First calculate the variance s22 Then take the square root to get the

standard deviation s

2

1

)(1

1xx

ns

n

i

Meanplusmn 1 sd

Wersquoll never calculate these by hand so make sure to know how to get the standard deviation using your calculator Excel or other software

Population Standard Deviation

2

1

Denoted by the lower case Greek letter

is the size (for example =34000 for NCSU)

is the mean

( )population standard deviation

va

po

lue of typically not known

us

pulation

populatio

e

n

N

ii

N N

y

N

s

to estimate value of

Remarks

1 The standard deviation of a set of measurements is an estimate of the likely size of the chance error in a single measurement

Remarks (cont)

2 Note that s and s are always greater than or equal to zero

3 The larger the value of s (or s ) the greater the spread of the data

When does s=0 When does s =0

When all data values are the same

Remarks (cont)4 The standard deviation is the most

commonly used measure of risk in finance and businessndash Stocks Mutual Funds etc

5 Variance s2 sample variance 2 population variance Units are squared units of the original data square $ square gallons

Review Properties of s and s s and s are always greater than or

equal to 0

when does s = 0 s = 0 The larger the value of s (or s) the

greater the spread of the data the standard deviation of a set of

measurements is an estimate of the likely size of the chance error in a single measurement

Summary of Notation

2

SAMPLE

sample mean

sample median

sample variance

sample stand dev

y

m

s

s

2

POPULATION

population mean

population median

population variance

population stand dev

m

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

68-95-997 rule

Mean andStandard Deviation

(numerical)

Histogram(graphical)

68-95-997 rule

The 68-95-997 ruleIf the histogram of the data is

approximately bell-shaped then1) approximately of the measurements

are of the mean

that is in ( )

2) approximately of the measurement

68

within 1 standard deviation

95

within 2 standard deviation

s

are of the meas n

that is

y s y s

almost all

within 3 standard deviation

in ( 2 2 )

3) the measurements

are of the mean

that is in ( 3 3 )

s

y s y s

y s y s

68-95-997 rule 68 within 1 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

68

3434

y-s y y+s

68-95-997 rule 95 within 2 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

95

475 475

y-2s y y+2s

Example textbook costs

37548

4272

50

y

s

n

286 291 307 308 315 316 327328 340 342 346 347 348 348 349354 355 355 360 361 364 367 369371 373 377 380 381 382 385 385387 390 390 397 398 409 409 410418 422 424 425 426 428 433 434437 440 480

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( ) (33276 41820)

32percentage of data values in this interval 64

5068-95-997 rule 68

y s

y s y s

1 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 2 2 ) (29004 46092)

48percentage of data values in this interval 96

5068-95-997 rule 95

y s

y s y s

2 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 3 3 ) (24732 50364)

50percentage of data values in this interval 100

5068-95-997 rule 997

y s

y s y s

3 standard deviation interval about the mean

The best estimate of the standard deviation of the menrsquos weights

displayed in this dotplot is

1 10

2 15

3 20

4 40

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

Preceding slides Next

Z-scores Standardized Data Values

Measures the distance of a number from the mean in units of

the standard deviation

z-score corresponding to y

where

original data value

the sample mean

s the sample standard deviation

the z-score corresponding to

y yz

s

y

y

z y

Exam 1 y1 = 88 s1 = 6 your exam 1 score 91

Exam 2 y2 = 88 s2 = 10 your exam 2 score 92

Which score is better

1

2

91 88 3z 5

6 692 88 4

z 410 10

91 on exam 1 is better than 92 on exam 2

If data has mean and standard deviation

then standardizing a particular value of

indicates how many standard deviations

is above or below the mean

y s

y

y

y

Comparing SAT and ACT Scores

SAT Math Eleanorrsquos score 680

SAT mean =500 sd=100 ACT Math Geraldrsquos score 27

ACT mean=18 sd=6 Eleanorrsquos z-score z=(680-500)100=18 Geraldrsquos z-score z=(27-18)6=15 Eleanorrsquos score is better

Z-scores add to zeroStudentInstitutional Support to Athletic Depts For the 9 Public ACC

Schools 2013 ($ millions)

School Support y - ybar Z-score

Maryland 155 64 179

UVA 131 40 112

Louisville 109 18 050

UNC 92 01 003

VaTech 79 -12 -034

FSU 79 -12 -034

GaTech 71 -20 -056

NCSU 65 -26 -073

Clemson 38 -53 -147

Mean=91000 s=35697

Sum = 0 Sum = 0

Recently the mean tuition at 4-yr public collegesuniversities in the US was $6185 with a standard deviation of $1804 In NC the mean tuition was $4320 What is NCrsquos z-score

1 103

2 -103

3 239

4 1865

5 -1865

Section 34Measures of Position (also called Measures of Relative Standing)

Quartiles

5-Number Summary

Interquartile Range Another Measure of Spread

Boxplots

m = median = 34

Q1= first quartile = 23

Q3= third quartile = 42

1 1 062 2 123 3 164 4 195 5 156 6 217 7 238 6 239 5 2510 4 2811 3 2912 2 3313 1 3414 2 3615 3 3716 4 3817 5 3918 6 4119 7 4220 6 4521 5 4722 4 4923 3 5324 2 5625 1 61

Quartiles Measuring spread by examining the middleThe first quartile Q1 is the value in the

sample that has 25 of the data at or

below it (Q1 is the median of the lower

half of the sorted data)

The third quartile Q3 is the value in the

sample that has 75 of the data at or

below it (Q3 is the median of the upper

half of the sorted data)

Quartiles and median divide data into 4 pieces

Q1 M Q3

14 14 14 14

Quartiles are common measures of spread

httpoirpncsueduiradmit

httpoirpncsueduunivpeer

University of Southern California

Economic Value of College Majors

Rules for Calculating QuartilesStep 1 find the median of all the data (the median divides the data in half)

Step 2a find the median of the lower half this median is Q1Step 2b find the median of the upper half this median is Q3

Importantwhen n is odd include the overall median in both halveswhen n is even do not include the overall median in either half

Example 2 4 6 8 10 12 14 16 18 20 n = 10

Median m = (10+12)2 = 222 = 11

Q1 median of lower half 2 4 6 8 10

Q1 = 6

Q3 median of upper half 12 14 16 18 20

Q3 = 16

11

Pulse Rates n = 138

Stem Leaves4

3 4 5889 5 00123344410 5 555678889923 6 0001111112223333334444423 6 5555666666777778888888816 7 0000011222233444423 7 5555566666677788888899910 8 000011222410 8 55556677894 9 00122 9 584 10 0223

101 11 1

Median mean of pulses in locations 69 amp 70 median= (70+70)2=70

Q1 median of lower half (lower half = 69 smallest pulses) Q1 = pulse in ordered position 35Q1 = 63

Q3 median of upper half (upper half = 69 largest pulses) Q3= pulse in position 35 from the high end Q3=78

Below are the weights of 31 linemen on the NCSU football team What is the

value of the first quartile Q1

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 287

2 2575

3 2635

4 2625

Interquartile range another measure of spread

lower quartile Q1

middle quartile median upper quartile Q3

interquartile range (IQR)

IQR = Q3 ndash Q1

measures spread of middle 50 of the data

Example beginning pulse rates

Q3 = 78 Q1 = 63

IQR = 78 ndash 63 = 15

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 61: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Below are the annual tuition charges at 7 public universities What is the median

tuition

4429496052455546497155877586

1 5245

2 49655

3 5546

4 4971

Properties of Mean Median1The mean and median are unique that is a

data set has only 1 mean and 1 median (the mean and median are not necessarily equal)

2The mean uses the value of every number in the data set the median does not

14

20 4 6Ex 2 4 6 8 5 5

4 2

21 4 6Ex 2 4 6 9 5 5

4 2

x m

x m

Example class pulse rates

53 64 67 67 70 76 77 77 78 83 84 85 85 89 90 90 90 90 91 96 98 103 140

23

1

23

844823

location 12th obs 85

ii

n

xx

m m

2010 2014 baseball salaries

2010

n = 845

mean = $3297828

median = $1330000

max = $33000000

2014

n = 848

mean = $3932912

median = $1456250

max = $28000000

>

Disadvantage of the mean

Can be greatly influenced by just a few observations that are much greater or much smaller than the rest of the data

Mean Median Maximum Baseball Salaries 1985 - 201419

85

1987

1989

1991

1993

1995

1997

1999

2001

2003

2005

2007

2009

2011

2013

200000

700000

1200000

1700000

2200000

2700000

3200000

3700000

0

5000000

10000000

15000000

20000000

25000000

30000000

35000000

Baseball Salaries Mean Median and Maximum 1985-2014

Mean Median Maximum

Year

Mea

n M

edia

n S

alar

y

Max

imu

m S

alar

y

Skewness comparing the mean and median

Skewed to the right (positively skewed) meangtmedian

53

490

102 7235 21 26 17 8 10 2 3 1 0 0 1

0

100

200

300

400

500

600

Freq

uenc

y

Salary ($1000s)

2011 Baseball Salaries

Skewed to the left negatively skewed

Mean lt median mean=78 median=87

Histogram of Exam Scores

0

10

20

30

20 30 40 50 60 70 80 90 100Exam Scores

Fre

qu

en

cy

Symmetric data

mean median approx equal

Bank Customers 1000-1100 am

0

5

10

15

20

Number of Customers

Fre

qu

en

cy

Section 33Describing Variability of Data

Standard Deviation

Using the Mean and Standard Deviation Together 68-95-997

Rule (Empirical Rule)

Recall 2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability

measures how ldquospread outrdquo the data is

Ways to measure variability

1 range=largest-smallest

ok sometimes in general too crude sensitive to one large or small obs

1

2 where

the middle is the mean

deviation of from the mean

( ) sum the deviations of all the s from

measure spread from the middle

i i

n

i ii

y

y y y

y y y y

1

( ) 0 always tells us nothingn

ii

y y

Example

1 2

1 2

1 2

1 2

sum of deviations from mean

49 51 50

( ) ( ) (49 50) (51 50) 1 1 0

0 100

Data set 1

Data set 2 50

( ) ( ) (0 50) (100 50) 50 50 0

x x x

x x x x

y y y

y y y y

The Sample Standard Deviation a measure of spread around the mean Square the deviation of each

observation from the mean find the square root of the ldquoaveragerdquo of these squared deviations

2

1

2

2 1

( )sample standard deviation

1

( )is called the sample variance

1

n

ii

n

ii

y ys

n

y ys

n

Calculations hellip

Mean = 634

Sum of squared deviations from mean = 852

(n minus 1) = 13 (n minus 1) is called degrees freedom (df)

s2 = variance = 85213 = 655 square inches

s = standard deviation = radic655 = 256 inches

Women height (inches)i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

2

1

2 )(1

1xx

ns

n

i

1 First calculate the variance s22 Then take the square root to get the

standard deviation s

2

1

)(1

1xx

ns

n

i

Meanplusmn 1 sd

Wersquoll never calculate these by hand so make sure to know how to get the standard deviation using your calculator Excel or other software

Population Standard Deviation

2

1

Denoted by the lower case Greek letter

is the size (for example =34000 for NCSU)

is the mean

( )population standard deviation

va

po

lue of typically not known

us

pulation

populatio

e

n

N

ii

N N

y

N

s

to estimate value of

Remarks

1 The standard deviation of a set of measurements is an estimate of the likely size of the chance error in a single measurement

Remarks (cont)

2 Note that s and s are always greater than or equal to zero

3 The larger the value of s (or s ) the greater the spread of the data

When does s=0 When does s =0

When all data values are the same

Remarks (cont)4 The standard deviation is the most

commonly used measure of risk in finance and businessndash Stocks Mutual Funds etc

5 Variance s2 sample variance 2 population variance Units are squared units of the original data square $ square gallons

Review Properties of s and s s and s are always greater than or

equal to 0

when does s = 0 s = 0 The larger the value of s (or s) the

greater the spread of the data the standard deviation of a set of

measurements is an estimate of the likely size of the chance error in a single measurement

Summary of Notation

2

SAMPLE

sample mean

sample median

sample variance

sample stand dev

y

m

s

s

2

POPULATION

population mean

population median

population variance

population stand dev

m

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

68-95-997 rule

Mean andStandard Deviation

(numerical)

Histogram(graphical)

68-95-997 rule

The 68-95-997 ruleIf the histogram of the data is

approximately bell-shaped then1) approximately of the measurements

are of the mean

that is in ( )

2) approximately of the measurement

68

within 1 standard deviation

95

within 2 standard deviation

s

are of the meas n

that is

y s y s

almost all

within 3 standard deviation

in ( 2 2 )

3) the measurements

are of the mean

that is in ( 3 3 )

s

y s y s

y s y s

68-95-997 rule 68 within 1 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

68

3434

y-s y y+s

68-95-997 rule 95 within 2 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

95

475 475

y-2s y y+2s

Example textbook costs

37548

4272

50

y

s

n

286 291 307 308 315 316 327328 340 342 346 347 348 348 349354 355 355 360 361 364 367 369371 373 377 380 381 382 385 385387 390 390 397 398 409 409 410418 422 424 425 426 428 433 434437 440 480

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( ) (33276 41820)

32percentage of data values in this interval 64

5068-95-997 rule 68

y s

y s y s

1 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 2 2 ) (29004 46092)

48percentage of data values in this interval 96

5068-95-997 rule 95

y s

y s y s

2 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 3 3 ) (24732 50364)

50percentage of data values in this interval 100

5068-95-997 rule 997

y s

y s y s

3 standard deviation interval about the mean

The best estimate of the standard deviation of the menrsquos weights

displayed in this dotplot is

1 10

2 15

3 20

4 40

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

Preceding slides Next

Z-scores Standardized Data Values

Measures the distance of a number from the mean in units of

the standard deviation

z-score corresponding to y

where

original data value

the sample mean

s the sample standard deviation

the z-score corresponding to

y yz

s

y

y

z y

Exam 1 y1 = 88 s1 = 6 your exam 1 score 91

Exam 2 y2 = 88 s2 = 10 your exam 2 score 92

Which score is better

1

2

91 88 3z 5

6 692 88 4

z 410 10

91 on exam 1 is better than 92 on exam 2

If data has mean and standard deviation

then standardizing a particular value of

indicates how many standard deviations

is above or below the mean

y s

y

y

y

Comparing SAT and ACT Scores

SAT Math Eleanorrsquos score 680

SAT mean =500 sd=100 ACT Math Geraldrsquos score 27

ACT mean=18 sd=6 Eleanorrsquos z-score z=(680-500)100=18 Geraldrsquos z-score z=(27-18)6=15 Eleanorrsquos score is better

Z-scores add to zeroStudentInstitutional Support to Athletic Depts For the 9 Public ACC

Schools 2013 ($ millions)

School Support y - ybar Z-score

Maryland 155 64 179

UVA 131 40 112

Louisville 109 18 050

UNC 92 01 003

VaTech 79 -12 -034

FSU 79 -12 -034

GaTech 71 -20 -056

NCSU 65 -26 -073

Clemson 38 -53 -147

Mean=91000 s=35697

Sum = 0 Sum = 0

Recently the mean tuition at 4-yr public collegesuniversities in the US was $6185 with a standard deviation of $1804 In NC the mean tuition was $4320 What is NCrsquos z-score

1 103

2 -103

3 239

4 1865

5 -1865

Section 34Measures of Position (also called Measures of Relative Standing)

Quartiles

5-Number Summary

Interquartile Range Another Measure of Spread

Boxplots

m = median = 34

Q1= first quartile = 23

Q3= third quartile = 42

1 1 062 2 123 3 164 4 195 5 156 6 217 7 238 6 239 5 2510 4 2811 3 2912 2 3313 1 3414 2 3615 3 3716 4 3817 5 3918 6 4119 7 4220 6 4521 5 4722 4 4923 3 5324 2 5625 1 61

Quartiles Measuring spread by examining the middleThe first quartile Q1 is the value in the

sample that has 25 of the data at or

below it (Q1 is the median of the lower

half of the sorted data)

The third quartile Q3 is the value in the

sample that has 75 of the data at or

below it (Q3 is the median of the upper

half of the sorted data)

Quartiles and median divide data into 4 pieces

Q1 M Q3

14 14 14 14

Quartiles are common measures of spread

httpoirpncsueduiradmit

httpoirpncsueduunivpeer

University of Southern California

Economic Value of College Majors

Rules for Calculating QuartilesStep 1 find the median of all the data (the median divides the data in half)

Step 2a find the median of the lower half this median is Q1Step 2b find the median of the upper half this median is Q3

Importantwhen n is odd include the overall median in both halveswhen n is even do not include the overall median in either half

Example 2 4 6 8 10 12 14 16 18 20 n = 10

Median m = (10+12)2 = 222 = 11

Q1 median of lower half 2 4 6 8 10

Q1 = 6

Q3 median of upper half 12 14 16 18 20

Q3 = 16

11

Pulse Rates n = 138

Stem Leaves4

3 4 5889 5 00123344410 5 555678889923 6 0001111112223333334444423 6 5555666666777778888888816 7 0000011222233444423 7 5555566666677788888899910 8 000011222410 8 55556677894 9 00122 9 584 10 0223

101 11 1

Median mean of pulses in locations 69 amp 70 median= (70+70)2=70

Q1 median of lower half (lower half = 69 smallest pulses) Q1 = pulse in ordered position 35Q1 = 63

Q3 median of upper half (upper half = 69 largest pulses) Q3= pulse in position 35 from the high end Q3=78

Below are the weights of 31 linemen on the NCSU football team What is the

value of the first quartile Q1

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 287

2 2575

3 2635

4 2625

Interquartile range another measure of spread

lower quartile Q1

middle quartile median upper quartile Q3

interquartile range (IQR)

IQR = Q3 ndash Q1

measures spread of middle 50 of the data

Example beginning pulse rates

Q3 = 78 Q1 = 63

IQR = 78 ndash 63 = 15

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 62: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Properties of Mean Median1The mean and median are unique that is a

data set has only 1 mean and 1 median (the mean and median are not necessarily equal)

2The mean uses the value of every number in the data set the median does not

14

20 4 6Ex 2 4 6 8 5 5

4 2

21 4 6Ex 2 4 6 9 5 5

4 2

x m

x m

Example class pulse rates

53 64 67 67 70 76 77 77 78 83 84 85 85 89 90 90 90 90 91 96 98 103 140

23

1

23

844823

location 12th obs 85

ii

n

xx

m m

2010 2014 baseball salaries

2010

n = 845

mean = $3297828

median = $1330000

max = $33000000

2014

n = 848

mean = $3932912

median = $1456250

max = $28000000

>

Disadvantage of the mean

Can be greatly influenced by just a few observations that are much greater or much smaller than the rest of the data

Mean Median Maximum Baseball Salaries 1985 - 201419

85

1987

1989

1991

1993

1995

1997

1999

2001

2003

2005

2007

2009

2011

2013

200000

700000

1200000

1700000

2200000

2700000

3200000

3700000

0

5000000

10000000

15000000

20000000

25000000

30000000

35000000

Baseball Salaries Mean Median and Maximum 1985-2014

Mean Median Maximum

Year

Mea

n M

edia

n S

alar

y

Max

imu

m S

alar

y

Skewness comparing the mean and median

Skewed to the right (positively skewed) meangtmedian

53

490

102 7235 21 26 17 8 10 2 3 1 0 0 1

0

100

200

300

400

500

600

Freq

uenc

y

Salary ($1000s)

2011 Baseball Salaries

Skewed to the left negatively skewed

Mean lt median mean=78 median=87

Histogram of Exam Scores

0

10

20

30

20 30 40 50 60 70 80 90 100Exam Scores

Fre

qu

en

cy

Symmetric data

mean median approx equal

Bank Customers 1000-1100 am

0

5

10

15

20

Number of Customers

Fre

qu

en

cy

Section 33Describing Variability of Data

Standard Deviation

Using the Mean and Standard Deviation Together 68-95-997

Rule (Empirical Rule)

Recall 2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability

measures how ldquospread outrdquo the data is

Ways to measure variability

1 range=largest-smallest

ok sometimes in general too crude sensitive to one large or small obs

1

2 where

the middle is the mean

deviation of from the mean

( ) sum the deviations of all the s from

measure spread from the middle

i i

n

i ii

y

y y y

y y y y

1

( ) 0 always tells us nothingn

ii

y y

Example

1 2

1 2

1 2

1 2

sum of deviations from mean

49 51 50

( ) ( ) (49 50) (51 50) 1 1 0

0 100

Data set 1

Data set 2 50

( ) ( ) (0 50) (100 50) 50 50 0

x x x

x x x x

y y y

y y y y

The Sample Standard Deviation a measure of spread around the mean Square the deviation of each

observation from the mean find the square root of the ldquoaveragerdquo of these squared deviations

2

1

2

2 1

( )sample standard deviation

1

( )is called the sample variance

1

n

ii

n

ii

y ys

n

y ys

n

Calculations hellip

Mean = 634

Sum of squared deviations from mean = 852

(n minus 1) = 13 (n minus 1) is called degrees freedom (df)

s2 = variance = 85213 = 655 square inches

s = standard deviation = radic655 = 256 inches

Women height (inches)i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

2

1

2 )(1

1xx

ns

n

i

1 First calculate the variance s22 Then take the square root to get the

standard deviation s

2

1

)(1

1xx

ns

n

i

Meanplusmn 1 sd

Wersquoll never calculate these by hand so make sure to know how to get the standard deviation using your calculator Excel or other software

Population Standard Deviation

2

1

Denoted by the lower case Greek letter

is the size (for example =34000 for NCSU)

is the mean

( )population standard deviation

va

po

lue of typically not known

us

pulation

populatio

e

n

N

ii

N N

y

N

s

to estimate value of

Remarks

1 The standard deviation of a set of measurements is an estimate of the likely size of the chance error in a single measurement

Remarks (cont)

2 Note that s and s are always greater than or equal to zero

3 The larger the value of s (or s ) the greater the spread of the data

When does s=0 When does s =0

When all data values are the same

Remarks (cont)4 The standard deviation is the most

commonly used measure of risk in finance and businessndash Stocks Mutual Funds etc

5 Variance s2 sample variance 2 population variance Units are squared units of the original data square $ square gallons

Review Properties of s and s s and s are always greater than or

equal to 0

when does s = 0 s = 0 The larger the value of s (or s) the

greater the spread of the data the standard deviation of a set of

measurements is an estimate of the likely size of the chance error in a single measurement

Summary of Notation

2

SAMPLE

sample mean

sample median

sample variance

sample stand dev

y

m

s

s

2

POPULATION

population mean

population median

population variance

population stand dev

m

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

68-95-997 rule

Mean andStandard Deviation

(numerical)

Histogram(graphical)

68-95-997 rule

The 68-95-997 ruleIf the histogram of the data is

approximately bell-shaped then1) approximately of the measurements

are of the mean

that is in ( )

2) approximately of the measurement

68

within 1 standard deviation

95

within 2 standard deviation

s

are of the meas n

that is

y s y s

almost all

within 3 standard deviation

in ( 2 2 )

3) the measurements

are of the mean

that is in ( 3 3 )

s

y s y s

y s y s

68-95-997 rule 68 within 1 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

68

3434

y-s y y+s

68-95-997 rule 95 within 2 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

95

475 475

y-2s y y+2s

Example textbook costs

37548

4272

50

y

s

n

286 291 307 308 315 316 327328 340 342 346 347 348 348 349354 355 355 360 361 364 367 369371 373 377 380 381 382 385 385387 390 390 397 398 409 409 410418 422 424 425 426 428 433 434437 440 480

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( ) (33276 41820)

32percentage of data values in this interval 64

5068-95-997 rule 68

y s

y s y s

1 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 2 2 ) (29004 46092)

48percentage of data values in this interval 96

5068-95-997 rule 95

y s

y s y s

2 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 3 3 ) (24732 50364)

50percentage of data values in this interval 100

5068-95-997 rule 997

y s

y s y s

3 standard deviation interval about the mean

The best estimate of the standard deviation of the menrsquos weights

displayed in this dotplot is

1 10

2 15

3 20

4 40

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

Preceding slides Next

Z-scores Standardized Data Values

Measures the distance of a number from the mean in units of

the standard deviation

z-score corresponding to y

where

original data value

the sample mean

s the sample standard deviation

the z-score corresponding to

y yz

s

y

y

z y

Exam 1 y1 = 88 s1 = 6 your exam 1 score 91

Exam 2 y2 = 88 s2 = 10 your exam 2 score 92

Which score is better

1

2

91 88 3z 5

6 692 88 4

z 410 10

91 on exam 1 is better than 92 on exam 2

If data has mean and standard deviation

then standardizing a particular value of

indicates how many standard deviations

is above or below the mean

y s

y

y

y

Comparing SAT and ACT Scores

SAT Math Eleanorrsquos score 680

SAT mean =500 sd=100 ACT Math Geraldrsquos score 27

ACT mean=18 sd=6 Eleanorrsquos z-score z=(680-500)100=18 Geraldrsquos z-score z=(27-18)6=15 Eleanorrsquos score is better

Z-scores add to zeroStudentInstitutional Support to Athletic Depts For the 9 Public ACC

Schools 2013 ($ millions)

School Support y - ybar Z-score

Maryland 155 64 179

UVA 131 40 112

Louisville 109 18 050

UNC 92 01 003

VaTech 79 -12 -034

FSU 79 -12 -034

GaTech 71 -20 -056

NCSU 65 -26 -073

Clemson 38 -53 -147

Mean=91000 s=35697

Sum = 0 Sum = 0

Recently the mean tuition at 4-yr public collegesuniversities in the US was $6185 with a standard deviation of $1804 In NC the mean tuition was $4320 What is NCrsquos z-score

1 103

2 -103

3 239

4 1865

5 -1865

Section 34Measures of Position (also called Measures of Relative Standing)

Quartiles

5-Number Summary

Interquartile Range Another Measure of Spread

Boxplots

m = median = 34

Q1= first quartile = 23

Q3= third quartile = 42

1 1 062 2 123 3 164 4 195 5 156 6 217 7 238 6 239 5 2510 4 2811 3 2912 2 3313 1 3414 2 3615 3 3716 4 3817 5 3918 6 4119 7 4220 6 4521 5 4722 4 4923 3 5324 2 5625 1 61

Quartiles Measuring spread by examining the middleThe first quartile Q1 is the value in the

sample that has 25 of the data at or

below it (Q1 is the median of the lower

half of the sorted data)

The third quartile Q3 is the value in the

sample that has 75 of the data at or

below it (Q3 is the median of the upper

half of the sorted data)

Quartiles and median divide data into 4 pieces

Q1 M Q3

14 14 14 14

Quartiles are common measures of spread

httpoirpncsueduiradmit

httpoirpncsueduunivpeer

University of Southern California

Economic Value of College Majors

Rules for Calculating QuartilesStep 1 find the median of all the data (the median divides the data in half)

Step 2a find the median of the lower half this median is Q1Step 2b find the median of the upper half this median is Q3

Importantwhen n is odd include the overall median in both halveswhen n is even do not include the overall median in either half

Example 2 4 6 8 10 12 14 16 18 20 n = 10

Median m = (10+12)2 = 222 = 11

Q1 median of lower half 2 4 6 8 10

Q1 = 6

Q3 median of upper half 12 14 16 18 20

Q3 = 16

11

Pulse Rates n = 138

Stem Leaves4

3 4 5889 5 00123344410 5 555678889923 6 0001111112223333334444423 6 5555666666777778888888816 7 0000011222233444423 7 5555566666677788888899910 8 000011222410 8 55556677894 9 00122 9 584 10 0223

101 11 1

Median mean of pulses in locations 69 amp 70 median= (70+70)2=70

Q1 median of lower half (lower half = 69 smallest pulses) Q1 = pulse in ordered position 35Q1 = 63

Q3 median of upper half (upper half = 69 largest pulses) Q3= pulse in position 35 from the high end Q3=78

Below are the weights of 31 linemen on the NCSU football team What is the

value of the first quartile Q1

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 287

2 2575

3 2635

4 2625

Interquartile range another measure of spread

lower quartile Q1

middle quartile median upper quartile Q3

interquartile range (IQR)

IQR = Q3 ndash Q1

measures spread of middle 50 of the data

Example beginning pulse rates

Q3 = 78 Q1 = 63

IQR = 78 ndash 63 = 15

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 63: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Example class pulse rates

53 64 67 67 70 76 77 77 78 83 84 85 85 89 90 90 90 90 91 96 98 103 140

23

1

23

844823

location 12th obs 85

ii

n

xx

m m

2010 2014 baseball salaries

2010

n = 845

mean = $3297828

median = $1330000

max = $33000000

2014

n = 848

mean = $3932912

median = $1456250

max = $28000000

>

Disadvantage of the mean

Can be greatly influenced by just a few observations that are much greater or much smaller than the rest of the data

Mean Median Maximum Baseball Salaries 1985 - 201419

85

1987

1989

1991

1993

1995

1997

1999

2001

2003

2005

2007

2009

2011

2013

200000

700000

1200000

1700000

2200000

2700000

3200000

3700000

0

5000000

10000000

15000000

20000000

25000000

30000000

35000000

Baseball Salaries Mean Median and Maximum 1985-2014

Mean Median Maximum

Year

Mea

n M

edia

n S

alar

y

Max

imu

m S

alar

y

Skewness comparing the mean and median

Skewed to the right (positively skewed) meangtmedian

53

490

102 7235 21 26 17 8 10 2 3 1 0 0 1

0

100

200

300

400

500

600

Freq

uenc

y

Salary ($1000s)

2011 Baseball Salaries

Skewed to the left negatively skewed

Mean lt median mean=78 median=87

Histogram of Exam Scores

0

10

20

30

20 30 40 50 60 70 80 90 100Exam Scores

Fre

qu

en

cy

Symmetric data

mean median approx equal

Bank Customers 1000-1100 am

0

5

10

15

20

Number of Customers

Fre

qu

en

cy

Section 33Describing Variability of Data

Standard Deviation

Using the Mean and Standard Deviation Together 68-95-997

Rule (Empirical Rule)

Recall 2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability

measures how ldquospread outrdquo the data is

Ways to measure variability

1 range=largest-smallest

ok sometimes in general too crude sensitive to one large or small obs

1

2 where

the middle is the mean

deviation of from the mean

( ) sum the deviations of all the s from

measure spread from the middle

i i

n

i ii

y

y y y

y y y y

1

( ) 0 always tells us nothingn

ii

y y

Example

1 2

1 2

1 2

1 2

sum of deviations from mean

49 51 50

( ) ( ) (49 50) (51 50) 1 1 0

0 100

Data set 1

Data set 2 50

( ) ( ) (0 50) (100 50) 50 50 0

x x x

x x x x

y y y

y y y y

The Sample Standard Deviation a measure of spread around the mean Square the deviation of each

observation from the mean find the square root of the ldquoaveragerdquo of these squared deviations

2

1

2

2 1

( )sample standard deviation

1

( )is called the sample variance

1

n

ii

n

ii

y ys

n

y ys

n

Calculations hellip

Mean = 634

Sum of squared deviations from mean = 852

(n minus 1) = 13 (n minus 1) is called degrees freedom (df)

s2 = variance = 85213 = 655 square inches

s = standard deviation = radic655 = 256 inches

Women height (inches)i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

2

1

2 )(1

1xx

ns

n

i

1 First calculate the variance s22 Then take the square root to get the

standard deviation s

2

1

)(1

1xx

ns

n

i

Meanplusmn 1 sd

Wersquoll never calculate these by hand so make sure to know how to get the standard deviation using your calculator Excel or other software

Population Standard Deviation

2

1

Denoted by the lower case Greek letter

is the size (for example =34000 for NCSU)

is the mean

( )population standard deviation

va

po

lue of typically not known

us

pulation

populatio

e

n

N

ii

N N

y

N

s

to estimate value of

Remarks

1 The standard deviation of a set of measurements is an estimate of the likely size of the chance error in a single measurement

Remarks (cont)

2 Note that s and s are always greater than or equal to zero

3 The larger the value of s (or s ) the greater the spread of the data

When does s=0 When does s =0

When all data values are the same

Remarks (cont)4 The standard deviation is the most

commonly used measure of risk in finance and businessndash Stocks Mutual Funds etc

5 Variance s2 sample variance 2 population variance Units are squared units of the original data square $ square gallons

Review Properties of s and s s and s are always greater than or

equal to 0

when does s = 0 s = 0 The larger the value of s (or s) the

greater the spread of the data the standard deviation of a set of

measurements is an estimate of the likely size of the chance error in a single measurement

Summary of Notation

2

SAMPLE

sample mean

sample median

sample variance

sample stand dev

y

m

s

s

2

POPULATION

population mean

population median

population variance

population stand dev

m

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

68-95-997 rule

Mean andStandard Deviation

(numerical)

Histogram(graphical)

68-95-997 rule

The 68-95-997 ruleIf the histogram of the data is

approximately bell-shaped then1) approximately of the measurements

are of the mean

that is in ( )

2) approximately of the measurement

68

within 1 standard deviation

95

within 2 standard deviation

s

are of the meas n

that is

y s y s

almost all

within 3 standard deviation

in ( 2 2 )

3) the measurements

are of the mean

that is in ( 3 3 )

s

y s y s

y s y s

68-95-997 rule 68 within 1 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

68

3434

y-s y y+s

68-95-997 rule 95 within 2 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

95

475 475

y-2s y y+2s

Example textbook costs

37548

4272

50

y

s

n

286 291 307 308 315 316 327328 340 342 346 347 348 348 349354 355 355 360 361 364 367 369371 373 377 380 381 382 385 385387 390 390 397 398 409 409 410418 422 424 425 426 428 433 434437 440 480

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( ) (33276 41820)

32percentage of data values in this interval 64

5068-95-997 rule 68

y s

y s y s

1 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 2 2 ) (29004 46092)

48percentage of data values in this interval 96

5068-95-997 rule 95

y s

y s y s

2 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 3 3 ) (24732 50364)

50percentage of data values in this interval 100

5068-95-997 rule 997

y s

y s y s

3 standard deviation interval about the mean

The best estimate of the standard deviation of the menrsquos weights

displayed in this dotplot is

1 10

2 15

3 20

4 40

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

Preceding slides Next

Z-scores Standardized Data Values

Measures the distance of a number from the mean in units of

the standard deviation

z-score corresponding to y

where

original data value

the sample mean

s the sample standard deviation

the z-score corresponding to

y yz

s

y

y

z y

Exam 1 y1 = 88 s1 = 6 your exam 1 score 91

Exam 2 y2 = 88 s2 = 10 your exam 2 score 92

Which score is better

1

2

91 88 3z 5

6 692 88 4

z 410 10

91 on exam 1 is better than 92 on exam 2

If data has mean and standard deviation

then standardizing a particular value of

indicates how many standard deviations

is above or below the mean

y s

y

y

y

Comparing SAT and ACT Scores

SAT Math Eleanorrsquos score 680

SAT mean =500 sd=100 ACT Math Geraldrsquos score 27

ACT mean=18 sd=6 Eleanorrsquos z-score z=(680-500)100=18 Geraldrsquos z-score z=(27-18)6=15 Eleanorrsquos score is better

Z-scores add to zeroStudentInstitutional Support to Athletic Depts For the 9 Public ACC

Schools 2013 ($ millions)

School Support y - ybar Z-score

Maryland 155 64 179

UVA 131 40 112

Louisville 109 18 050

UNC 92 01 003

VaTech 79 -12 -034

FSU 79 -12 -034

GaTech 71 -20 -056

NCSU 65 -26 -073

Clemson 38 -53 -147

Mean=91000 s=35697

Sum = 0 Sum = 0

Recently the mean tuition at 4-yr public collegesuniversities in the US was $6185 with a standard deviation of $1804 In NC the mean tuition was $4320 What is NCrsquos z-score

1 103

2 -103

3 239

4 1865

5 -1865

Section 34Measures of Position (also called Measures of Relative Standing)

Quartiles

5-Number Summary

Interquartile Range Another Measure of Spread

Boxplots

m = median = 34

Q1= first quartile = 23

Q3= third quartile = 42

1 1 062 2 123 3 164 4 195 5 156 6 217 7 238 6 239 5 2510 4 2811 3 2912 2 3313 1 3414 2 3615 3 3716 4 3817 5 3918 6 4119 7 4220 6 4521 5 4722 4 4923 3 5324 2 5625 1 61

Quartiles Measuring spread by examining the middleThe first quartile Q1 is the value in the

sample that has 25 of the data at or

below it (Q1 is the median of the lower

half of the sorted data)

The third quartile Q3 is the value in the

sample that has 75 of the data at or

below it (Q3 is the median of the upper

half of the sorted data)

Quartiles and median divide data into 4 pieces

Q1 M Q3

14 14 14 14

Quartiles are common measures of spread

httpoirpncsueduiradmit

httpoirpncsueduunivpeer

University of Southern California

Economic Value of College Majors

Rules for Calculating QuartilesStep 1 find the median of all the data (the median divides the data in half)

Step 2a find the median of the lower half this median is Q1Step 2b find the median of the upper half this median is Q3

Importantwhen n is odd include the overall median in both halveswhen n is even do not include the overall median in either half

Example 2 4 6 8 10 12 14 16 18 20 n = 10

Median m = (10+12)2 = 222 = 11

Q1 median of lower half 2 4 6 8 10

Q1 = 6

Q3 median of upper half 12 14 16 18 20

Q3 = 16

11

Pulse Rates n = 138

Stem Leaves4

3 4 5889 5 00123344410 5 555678889923 6 0001111112223333334444423 6 5555666666777778888888816 7 0000011222233444423 7 5555566666677788888899910 8 000011222410 8 55556677894 9 00122 9 584 10 0223

101 11 1

Median mean of pulses in locations 69 amp 70 median= (70+70)2=70

Q1 median of lower half (lower half = 69 smallest pulses) Q1 = pulse in ordered position 35Q1 = 63

Q3 median of upper half (upper half = 69 largest pulses) Q3= pulse in position 35 from the high end Q3=78

Below are the weights of 31 linemen on the NCSU football team What is the

value of the first quartile Q1

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 287

2 2575

3 2635

4 2625

Interquartile range another measure of spread

lower quartile Q1

middle quartile median upper quartile Q3

interquartile range (IQR)

IQR = Q3 ndash Q1

measures spread of middle 50 of the data

Example beginning pulse rates

Q3 = 78 Q1 = 63

IQR = 78 ndash 63 = 15

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 64: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

2010 2014 baseball salaries

2010

n = 845

mean = $3297828

median = $1330000

max = $33000000

2014

n = 848

mean = $3932912

median = $1456250

max = $28000000

>

Disadvantage of the mean

Can be greatly influenced by just a few observations that are much greater or much smaller than the rest of the data

Mean Median Maximum Baseball Salaries 1985 - 201419

85

1987

1989

1991

1993

1995

1997

1999

2001

2003

2005

2007

2009

2011

2013

200000

700000

1200000

1700000

2200000

2700000

3200000

3700000

0

5000000

10000000

15000000

20000000

25000000

30000000

35000000

Baseball Salaries Mean Median and Maximum 1985-2014

Mean Median Maximum

Year

Mea

n M

edia

n S

alar

y

Max

imu

m S

alar

y

Skewness comparing the mean and median

Skewed to the right (positively skewed) meangtmedian

53

490

102 7235 21 26 17 8 10 2 3 1 0 0 1

0

100

200

300

400

500

600

Freq

uenc

y

Salary ($1000s)

2011 Baseball Salaries

Skewed to the left negatively skewed

Mean lt median mean=78 median=87

Histogram of Exam Scores

0

10

20

30

20 30 40 50 60 70 80 90 100Exam Scores

Fre

qu

en

cy

Symmetric data

mean median approx equal

Bank Customers 1000-1100 am

0

5

10

15

20

Number of Customers

Fre

qu

en

cy

Section 33Describing Variability of Data

Standard Deviation

Using the Mean and Standard Deviation Together 68-95-997

Rule (Empirical Rule)

Recall 2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability

measures how ldquospread outrdquo the data is

Ways to measure variability

1 range=largest-smallest

ok sometimes in general too crude sensitive to one large or small obs

1

2 where

the middle is the mean

deviation of from the mean

( ) sum the deviations of all the s from

measure spread from the middle

i i

n

i ii

y

y y y

y y y y

1

( ) 0 always tells us nothingn

ii

y y

Example

1 2

1 2

1 2

1 2

sum of deviations from mean

49 51 50

( ) ( ) (49 50) (51 50) 1 1 0

0 100

Data set 1

Data set 2 50

( ) ( ) (0 50) (100 50) 50 50 0

x x x

x x x x

y y y

y y y y

The Sample Standard Deviation a measure of spread around the mean Square the deviation of each

observation from the mean find the square root of the ldquoaveragerdquo of these squared deviations

2

1

2

2 1

( )sample standard deviation

1

( )is called the sample variance

1

n

ii

n

ii

y ys

n

y ys

n

Calculations hellip

Mean = 634

Sum of squared deviations from mean = 852

(n minus 1) = 13 (n minus 1) is called degrees freedom (df)

s2 = variance = 85213 = 655 square inches

s = standard deviation = radic655 = 256 inches

Women height (inches)i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

2

1

2 )(1

1xx

ns

n

i

1 First calculate the variance s22 Then take the square root to get the

standard deviation s

2

1

)(1

1xx

ns

n

i

Meanplusmn 1 sd

Wersquoll never calculate these by hand so make sure to know how to get the standard deviation using your calculator Excel or other software

Population Standard Deviation

2

1

Denoted by the lower case Greek letter

is the size (for example =34000 for NCSU)

is the mean

( )population standard deviation

va

po

lue of typically not known

us

pulation

populatio

e

n

N

ii

N N

y

N

s

to estimate value of

Remarks

1 The standard deviation of a set of measurements is an estimate of the likely size of the chance error in a single measurement

Remarks (cont)

2 Note that s and s are always greater than or equal to zero

3 The larger the value of s (or s ) the greater the spread of the data

When does s=0 When does s =0

When all data values are the same

Remarks (cont)4 The standard deviation is the most

commonly used measure of risk in finance and businessndash Stocks Mutual Funds etc

5 Variance s2 sample variance 2 population variance Units are squared units of the original data square $ square gallons

Review Properties of s and s s and s are always greater than or

equal to 0

when does s = 0 s = 0 The larger the value of s (or s) the

greater the spread of the data the standard deviation of a set of

measurements is an estimate of the likely size of the chance error in a single measurement

Summary of Notation

2

SAMPLE

sample mean

sample median

sample variance

sample stand dev

y

m

s

s

2

POPULATION

population mean

population median

population variance

population stand dev

m

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

68-95-997 rule

Mean andStandard Deviation

(numerical)

Histogram(graphical)

68-95-997 rule

The 68-95-997 ruleIf the histogram of the data is

approximately bell-shaped then1) approximately of the measurements

are of the mean

that is in ( )

2) approximately of the measurement

68

within 1 standard deviation

95

within 2 standard deviation

s

are of the meas n

that is

y s y s

almost all

within 3 standard deviation

in ( 2 2 )

3) the measurements

are of the mean

that is in ( 3 3 )

s

y s y s

y s y s

68-95-997 rule 68 within 1 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

68

3434

y-s y y+s

68-95-997 rule 95 within 2 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

95

475 475

y-2s y y+2s

Example textbook costs

37548

4272

50

y

s

n

286 291 307 308 315 316 327328 340 342 346 347 348 348 349354 355 355 360 361 364 367 369371 373 377 380 381 382 385 385387 390 390 397 398 409 409 410418 422 424 425 426 428 433 434437 440 480

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( ) (33276 41820)

32percentage of data values in this interval 64

5068-95-997 rule 68

y s

y s y s

1 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 2 2 ) (29004 46092)

48percentage of data values in this interval 96

5068-95-997 rule 95

y s

y s y s

2 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 3 3 ) (24732 50364)

50percentage of data values in this interval 100

5068-95-997 rule 997

y s

y s y s

3 standard deviation interval about the mean

The best estimate of the standard deviation of the menrsquos weights

displayed in this dotplot is

1 10

2 15

3 20

4 40

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

Preceding slides Next

Z-scores Standardized Data Values

Measures the distance of a number from the mean in units of

the standard deviation

z-score corresponding to y

where

original data value

the sample mean

s the sample standard deviation

the z-score corresponding to

y yz

s

y

y

z y

Exam 1 y1 = 88 s1 = 6 your exam 1 score 91

Exam 2 y2 = 88 s2 = 10 your exam 2 score 92

Which score is better

1

2

91 88 3z 5

6 692 88 4

z 410 10

91 on exam 1 is better than 92 on exam 2

If data has mean and standard deviation

then standardizing a particular value of

indicates how many standard deviations

is above or below the mean

y s

y

y

y

Comparing SAT and ACT Scores

SAT Math Eleanorrsquos score 680

SAT mean =500 sd=100 ACT Math Geraldrsquos score 27

ACT mean=18 sd=6 Eleanorrsquos z-score z=(680-500)100=18 Geraldrsquos z-score z=(27-18)6=15 Eleanorrsquos score is better

Z-scores add to zeroStudentInstitutional Support to Athletic Depts For the 9 Public ACC

Schools 2013 ($ millions)

School Support y - ybar Z-score

Maryland 155 64 179

UVA 131 40 112

Louisville 109 18 050

UNC 92 01 003

VaTech 79 -12 -034

FSU 79 -12 -034

GaTech 71 -20 -056

NCSU 65 -26 -073

Clemson 38 -53 -147

Mean=91000 s=35697

Sum = 0 Sum = 0

Recently the mean tuition at 4-yr public collegesuniversities in the US was $6185 with a standard deviation of $1804 In NC the mean tuition was $4320 What is NCrsquos z-score

1 103

2 -103

3 239

4 1865

5 -1865

Section 34Measures of Position (also called Measures of Relative Standing)

Quartiles

5-Number Summary

Interquartile Range Another Measure of Spread

Boxplots

m = median = 34

Q1= first quartile = 23

Q3= third quartile = 42

1 1 062 2 123 3 164 4 195 5 156 6 217 7 238 6 239 5 2510 4 2811 3 2912 2 3313 1 3414 2 3615 3 3716 4 3817 5 3918 6 4119 7 4220 6 4521 5 4722 4 4923 3 5324 2 5625 1 61

Quartiles Measuring spread by examining the middleThe first quartile Q1 is the value in the

sample that has 25 of the data at or

below it (Q1 is the median of the lower

half of the sorted data)

The third quartile Q3 is the value in the

sample that has 75 of the data at or

below it (Q3 is the median of the upper

half of the sorted data)

Quartiles and median divide data into 4 pieces

Q1 M Q3

14 14 14 14

Quartiles are common measures of spread

httpoirpncsueduiradmit

httpoirpncsueduunivpeer

University of Southern California

Economic Value of College Majors

Rules for Calculating QuartilesStep 1 find the median of all the data (the median divides the data in half)

Step 2a find the median of the lower half this median is Q1Step 2b find the median of the upper half this median is Q3

Importantwhen n is odd include the overall median in both halveswhen n is even do not include the overall median in either half

Example 2 4 6 8 10 12 14 16 18 20 n = 10

Median m = (10+12)2 = 222 = 11

Q1 median of lower half 2 4 6 8 10

Q1 = 6

Q3 median of upper half 12 14 16 18 20

Q3 = 16

11

Pulse Rates n = 138

Stem Leaves4

3 4 5889 5 00123344410 5 555678889923 6 0001111112223333334444423 6 5555666666777778888888816 7 0000011222233444423 7 5555566666677788888899910 8 000011222410 8 55556677894 9 00122 9 584 10 0223

101 11 1

Median mean of pulses in locations 69 amp 70 median= (70+70)2=70

Q1 median of lower half (lower half = 69 smallest pulses) Q1 = pulse in ordered position 35Q1 = 63

Q3 median of upper half (upper half = 69 largest pulses) Q3= pulse in position 35 from the high end Q3=78

Below are the weights of 31 linemen on the NCSU football team What is the

value of the first quartile Q1

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 287

2 2575

3 2635

4 2625

Interquartile range another measure of spread

lower quartile Q1

middle quartile median upper quartile Q3

interquartile range (IQR)

IQR = Q3 ndash Q1

measures spread of middle 50 of the data

Example beginning pulse rates

Q3 = 78 Q1 = 63

IQR = 78 ndash 63 = 15

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 65: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Disadvantage of the mean

Can be greatly influenced by just a few observations that are much greater or much smaller than the rest of the data

Mean Median Maximum Baseball Salaries 1985 - 201419

85

1987

1989

1991

1993

1995

1997

1999

2001

2003

2005

2007

2009

2011

2013

200000

700000

1200000

1700000

2200000

2700000

3200000

3700000

0

5000000

10000000

15000000

20000000

25000000

30000000

35000000

Baseball Salaries Mean Median and Maximum 1985-2014

Mean Median Maximum

Year

Mea

n M

edia

n S

alar

y

Max

imu

m S

alar

y

Skewness comparing the mean and median

Skewed to the right (positively skewed) meangtmedian

53

490

102 7235 21 26 17 8 10 2 3 1 0 0 1

0

100

200

300

400

500

600

Freq

uenc

y

Salary ($1000s)

2011 Baseball Salaries

Skewed to the left negatively skewed

Mean lt median mean=78 median=87

Histogram of Exam Scores

0

10

20

30

20 30 40 50 60 70 80 90 100Exam Scores

Fre

qu

en

cy

Symmetric data

mean median approx equal

Bank Customers 1000-1100 am

0

5

10

15

20

Number of Customers

Fre

qu

en

cy

Section 33Describing Variability of Data

Standard Deviation

Using the Mean and Standard Deviation Together 68-95-997

Rule (Empirical Rule)

Recall 2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability

measures how ldquospread outrdquo the data is

Ways to measure variability

1 range=largest-smallest

ok sometimes in general too crude sensitive to one large or small obs

1

2 where

the middle is the mean

deviation of from the mean

( ) sum the deviations of all the s from

measure spread from the middle

i i

n

i ii

y

y y y

y y y y

1

( ) 0 always tells us nothingn

ii

y y

Example

1 2

1 2

1 2

1 2

sum of deviations from mean

49 51 50

( ) ( ) (49 50) (51 50) 1 1 0

0 100

Data set 1

Data set 2 50

( ) ( ) (0 50) (100 50) 50 50 0

x x x

x x x x

y y y

y y y y

The Sample Standard Deviation a measure of spread around the mean Square the deviation of each

observation from the mean find the square root of the ldquoaveragerdquo of these squared deviations

2

1

2

2 1

( )sample standard deviation

1

( )is called the sample variance

1

n

ii

n

ii

y ys

n

y ys

n

Calculations hellip

Mean = 634

Sum of squared deviations from mean = 852

(n minus 1) = 13 (n minus 1) is called degrees freedom (df)

s2 = variance = 85213 = 655 square inches

s = standard deviation = radic655 = 256 inches

Women height (inches)i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

2

1

2 )(1

1xx

ns

n

i

1 First calculate the variance s22 Then take the square root to get the

standard deviation s

2

1

)(1

1xx

ns

n

i

Meanplusmn 1 sd

Wersquoll never calculate these by hand so make sure to know how to get the standard deviation using your calculator Excel or other software

Population Standard Deviation

2

1

Denoted by the lower case Greek letter

is the size (for example =34000 for NCSU)

is the mean

( )population standard deviation

va

po

lue of typically not known

us

pulation

populatio

e

n

N

ii

N N

y

N

s

to estimate value of

Remarks

1 The standard deviation of a set of measurements is an estimate of the likely size of the chance error in a single measurement

Remarks (cont)

2 Note that s and s are always greater than or equal to zero

3 The larger the value of s (or s ) the greater the spread of the data

When does s=0 When does s =0

When all data values are the same

Remarks (cont)4 The standard deviation is the most

commonly used measure of risk in finance and businessndash Stocks Mutual Funds etc

5 Variance s2 sample variance 2 population variance Units are squared units of the original data square $ square gallons

Review Properties of s and s s and s are always greater than or

equal to 0

when does s = 0 s = 0 The larger the value of s (or s) the

greater the spread of the data the standard deviation of a set of

measurements is an estimate of the likely size of the chance error in a single measurement

Summary of Notation

2

SAMPLE

sample mean

sample median

sample variance

sample stand dev

y

m

s

s

2

POPULATION

population mean

population median

population variance

population stand dev

m

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

68-95-997 rule

Mean andStandard Deviation

(numerical)

Histogram(graphical)

68-95-997 rule

The 68-95-997 ruleIf the histogram of the data is

approximately bell-shaped then1) approximately of the measurements

are of the mean

that is in ( )

2) approximately of the measurement

68

within 1 standard deviation

95

within 2 standard deviation

s

are of the meas n

that is

y s y s

almost all

within 3 standard deviation

in ( 2 2 )

3) the measurements

are of the mean

that is in ( 3 3 )

s

y s y s

y s y s

68-95-997 rule 68 within 1 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

68

3434

y-s y y+s

68-95-997 rule 95 within 2 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

95

475 475

y-2s y y+2s

Example textbook costs

37548

4272

50

y

s

n

286 291 307 308 315 316 327328 340 342 346 347 348 348 349354 355 355 360 361 364 367 369371 373 377 380 381 382 385 385387 390 390 397 398 409 409 410418 422 424 425 426 428 433 434437 440 480

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( ) (33276 41820)

32percentage of data values in this interval 64

5068-95-997 rule 68

y s

y s y s

1 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 2 2 ) (29004 46092)

48percentage of data values in this interval 96

5068-95-997 rule 95

y s

y s y s

2 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 3 3 ) (24732 50364)

50percentage of data values in this interval 100

5068-95-997 rule 997

y s

y s y s

3 standard deviation interval about the mean

The best estimate of the standard deviation of the menrsquos weights

displayed in this dotplot is

1 10

2 15

3 20

4 40

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

Preceding slides Next

Z-scores Standardized Data Values

Measures the distance of a number from the mean in units of

the standard deviation

z-score corresponding to y

where

original data value

the sample mean

s the sample standard deviation

the z-score corresponding to

y yz

s

y

y

z y

Exam 1 y1 = 88 s1 = 6 your exam 1 score 91

Exam 2 y2 = 88 s2 = 10 your exam 2 score 92

Which score is better

1

2

91 88 3z 5

6 692 88 4

z 410 10

91 on exam 1 is better than 92 on exam 2

If data has mean and standard deviation

then standardizing a particular value of

indicates how many standard deviations

is above or below the mean

y s

y

y

y

Comparing SAT and ACT Scores

SAT Math Eleanorrsquos score 680

SAT mean =500 sd=100 ACT Math Geraldrsquos score 27

ACT mean=18 sd=6 Eleanorrsquos z-score z=(680-500)100=18 Geraldrsquos z-score z=(27-18)6=15 Eleanorrsquos score is better

Z-scores add to zeroStudentInstitutional Support to Athletic Depts For the 9 Public ACC

Schools 2013 ($ millions)

School Support y - ybar Z-score

Maryland 155 64 179

UVA 131 40 112

Louisville 109 18 050

UNC 92 01 003

VaTech 79 -12 -034

FSU 79 -12 -034

GaTech 71 -20 -056

NCSU 65 -26 -073

Clemson 38 -53 -147

Mean=91000 s=35697

Sum = 0 Sum = 0

Recently the mean tuition at 4-yr public collegesuniversities in the US was $6185 with a standard deviation of $1804 In NC the mean tuition was $4320 What is NCrsquos z-score

1 103

2 -103

3 239

4 1865

5 -1865

Section 34Measures of Position (also called Measures of Relative Standing)

Quartiles

5-Number Summary

Interquartile Range Another Measure of Spread

Boxplots

m = median = 34

Q1= first quartile = 23

Q3= third quartile = 42

1 1 062 2 123 3 164 4 195 5 156 6 217 7 238 6 239 5 2510 4 2811 3 2912 2 3313 1 3414 2 3615 3 3716 4 3817 5 3918 6 4119 7 4220 6 4521 5 4722 4 4923 3 5324 2 5625 1 61

Quartiles Measuring spread by examining the middleThe first quartile Q1 is the value in the

sample that has 25 of the data at or

below it (Q1 is the median of the lower

half of the sorted data)

The third quartile Q3 is the value in the

sample that has 75 of the data at or

below it (Q3 is the median of the upper

half of the sorted data)

Quartiles and median divide data into 4 pieces

Q1 M Q3

14 14 14 14

Quartiles are common measures of spread

httpoirpncsueduiradmit

httpoirpncsueduunivpeer

University of Southern California

Economic Value of College Majors

Rules for Calculating QuartilesStep 1 find the median of all the data (the median divides the data in half)

Step 2a find the median of the lower half this median is Q1Step 2b find the median of the upper half this median is Q3

Importantwhen n is odd include the overall median in both halveswhen n is even do not include the overall median in either half

Example 2 4 6 8 10 12 14 16 18 20 n = 10

Median m = (10+12)2 = 222 = 11

Q1 median of lower half 2 4 6 8 10

Q1 = 6

Q3 median of upper half 12 14 16 18 20

Q3 = 16

11

Pulse Rates n = 138

Stem Leaves4

3 4 5889 5 00123344410 5 555678889923 6 0001111112223333334444423 6 5555666666777778888888816 7 0000011222233444423 7 5555566666677788888899910 8 000011222410 8 55556677894 9 00122 9 584 10 0223

101 11 1

Median mean of pulses in locations 69 amp 70 median= (70+70)2=70

Q1 median of lower half (lower half = 69 smallest pulses) Q1 = pulse in ordered position 35Q1 = 63

Q3 median of upper half (upper half = 69 largest pulses) Q3= pulse in position 35 from the high end Q3=78

Below are the weights of 31 linemen on the NCSU football team What is the

value of the first quartile Q1

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 287

2 2575

3 2635

4 2625

Interquartile range another measure of spread

lower quartile Q1

middle quartile median upper quartile Q3

interquartile range (IQR)

IQR = Q3 ndash Q1

measures spread of middle 50 of the data

Example beginning pulse rates

Q3 = 78 Q1 = 63

IQR = 78 ndash 63 = 15

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 66: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Mean Median Maximum Baseball Salaries 1985 - 201419

85

1987

1989

1991

1993

1995

1997

1999

2001

2003

2005

2007

2009

2011

2013

200000

700000

1200000

1700000

2200000

2700000

3200000

3700000

0

5000000

10000000

15000000

20000000

25000000

30000000

35000000

Baseball Salaries Mean Median and Maximum 1985-2014

Mean Median Maximum

Year

Mea

n M

edia

n S

alar

y

Max

imu

m S

alar

y

Skewness comparing the mean and median

Skewed to the right (positively skewed) meangtmedian

53

490

102 7235 21 26 17 8 10 2 3 1 0 0 1

0

100

200

300

400

500

600

Freq

uenc

y

Salary ($1000s)

2011 Baseball Salaries

Skewed to the left negatively skewed

Mean lt median mean=78 median=87

Histogram of Exam Scores

0

10

20

30

20 30 40 50 60 70 80 90 100Exam Scores

Fre

qu

en

cy

Symmetric data

mean median approx equal

Bank Customers 1000-1100 am

0

5

10

15

20

Number of Customers

Fre

qu

en

cy

Section 33Describing Variability of Data

Standard Deviation

Using the Mean and Standard Deviation Together 68-95-997

Rule (Empirical Rule)

Recall 2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability

measures how ldquospread outrdquo the data is

Ways to measure variability

1 range=largest-smallest

ok sometimes in general too crude sensitive to one large or small obs

1

2 where

the middle is the mean

deviation of from the mean

( ) sum the deviations of all the s from

measure spread from the middle

i i

n

i ii

y

y y y

y y y y

1

( ) 0 always tells us nothingn

ii

y y

Example

1 2

1 2

1 2

1 2

sum of deviations from mean

49 51 50

( ) ( ) (49 50) (51 50) 1 1 0

0 100

Data set 1

Data set 2 50

( ) ( ) (0 50) (100 50) 50 50 0

x x x

x x x x

y y y

y y y y

The Sample Standard Deviation a measure of spread around the mean Square the deviation of each

observation from the mean find the square root of the ldquoaveragerdquo of these squared deviations

2

1

2

2 1

( )sample standard deviation

1

( )is called the sample variance

1

n

ii

n

ii

y ys

n

y ys

n

Calculations hellip

Mean = 634

Sum of squared deviations from mean = 852

(n minus 1) = 13 (n minus 1) is called degrees freedom (df)

s2 = variance = 85213 = 655 square inches

s = standard deviation = radic655 = 256 inches

Women height (inches)i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

2

1

2 )(1

1xx

ns

n

i

1 First calculate the variance s22 Then take the square root to get the

standard deviation s

2

1

)(1

1xx

ns

n

i

Meanplusmn 1 sd

Wersquoll never calculate these by hand so make sure to know how to get the standard deviation using your calculator Excel or other software

Population Standard Deviation

2

1

Denoted by the lower case Greek letter

is the size (for example =34000 for NCSU)

is the mean

( )population standard deviation

va

po

lue of typically not known

us

pulation

populatio

e

n

N

ii

N N

y

N

s

to estimate value of

Remarks

1 The standard deviation of a set of measurements is an estimate of the likely size of the chance error in a single measurement

Remarks (cont)

2 Note that s and s are always greater than or equal to zero

3 The larger the value of s (or s ) the greater the spread of the data

When does s=0 When does s =0

When all data values are the same

Remarks (cont)4 The standard deviation is the most

commonly used measure of risk in finance and businessndash Stocks Mutual Funds etc

5 Variance s2 sample variance 2 population variance Units are squared units of the original data square $ square gallons

Review Properties of s and s s and s are always greater than or

equal to 0

when does s = 0 s = 0 The larger the value of s (or s) the

greater the spread of the data the standard deviation of a set of

measurements is an estimate of the likely size of the chance error in a single measurement

Summary of Notation

2

SAMPLE

sample mean

sample median

sample variance

sample stand dev

y

m

s

s

2

POPULATION

population mean

population median

population variance

population stand dev

m

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

68-95-997 rule

Mean andStandard Deviation

(numerical)

Histogram(graphical)

68-95-997 rule

The 68-95-997 ruleIf the histogram of the data is

approximately bell-shaped then1) approximately of the measurements

are of the mean

that is in ( )

2) approximately of the measurement

68

within 1 standard deviation

95

within 2 standard deviation

s

are of the meas n

that is

y s y s

almost all

within 3 standard deviation

in ( 2 2 )

3) the measurements

are of the mean

that is in ( 3 3 )

s

y s y s

y s y s

68-95-997 rule 68 within 1 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

68

3434

y-s y y+s

68-95-997 rule 95 within 2 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

95

475 475

y-2s y y+2s

Example textbook costs

37548

4272

50

y

s

n

286 291 307 308 315 316 327328 340 342 346 347 348 348 349354 355 355 360 361 364 367 369371 373 377 380 381 382 385 385387 390 390 397 398 409 409 410418 422 424 425 426 428 433 434437 440 480

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( ) (33276 41820)

32percentage of data values in this interval 64

5068-95-997 rule 68

y s

y s y s

1 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 2 2 ) (29004 46092)

48percentage of data values in this interval 96

5068-95-997 rule 95

y s

y s y s

2 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 3 3 ) (24732 50364)

50percentage of data values in this interval 100

5068-95-997 rule 997

y s

y s y s

3 standard deviation interval about the mean

The best estimate of the standard deviation of the menrsquos weights

displayed in this dotplot is

1 10

2 15

3 20

4 40

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

Preceding slides Next

Z-scores Standardized Data Values

Measures the distance of a number from the mean in units of

the standard deviation

z-score corresponding to y

where

original data value

the sample mean

s the sample standard deviation

the z-score corresponding to

y yz

s

y

y

z y

Exam 1 y1 = 88 s1 = 6 your exam 1 score 91

Exam 2 y2 = 88 s2 = 10 your exam 2 score 92

Which score is better

1

2

91 88 3z 5

6 692 88 4

z 410 10

91 on exam 1 is better than 92 on exam 2

If data has mean and standard deviation

then standardizing a particular value of

indicates how many standard deviations

is above or below the mean

y s

y

y

y

Comparing SAT and ACT Scores

SAT Math Eleanorrsquos score 680

SAT mean =500 sd=100 ACT Math Geraldrsquos score 27

ACT mean=18 sd=6 Eleanorrsquos z-score z=(680-500)100=18 Geraldrsquos z-score z=(27-18)6=15 Eleanorrsquos score is better

Z-scores add to zeroStudentInstitutional Support to Athletic Depts For the 9 Public ACC

Schools 2013 ($ millions)

School Support y - ybar Z-score

Maryland 155 64 179

UVA 131 40 112

Louisville 109 18 050

UNC 92 01 003

VaTech 79 -12 -034

FSU 79 -12 -034

GaTech 71 -20 -056

NCSU 65 -26 -073

Clemson 38 -53 -147

Mean=91000 s=35697

Sum = 0 Sum = 0

Recently the mean tuition at 4-yr public collegesuniversities in the US was $6185 with a standard deviation of $1804 In NC the mean tuition was $4320 What is NCrsquos z-score

1 103

2 -103

3 239

4 1865

5 -1865

Section 34Measures of Position (also called Measures of Relative Standing)

Quartiles

5-Number Summary

Interquartile Range Another Measure of Spread

Boxplots

m = median = 34

Q1= first quartile = 23

Q3= third quartile = 42

1 1 062 2 123 3 164 4 195 5 156 6 217 7 238 6 239 5 2510 4 2811 3 2912 2 3313 1 3414 2 3615 3 3716 4 3817 5 3918 6 4119 7 4220 6 4521 5 4722 4 4923 3 5324 2 5625 1 61

Quartiles Measuring spread by examining the middleThe first quartile Q1 is the value in the

sample that has 25 of the data at or

below it (Q1 is the median of the lower

half of the sorted data)

The third quartile Q3 is the value in the

sample that has 75 of the data at or

below it (Q3 is the median of the upper

half of the sorted data)

Quartiles and median divide data into 4 pieces

Q1 M Q3

14 14 14 14

Quartiles are common measures of spread

httpoirpncsueduiradmit

httpoirpncsueduunivpeer

University of Southern California

Economic Value of College Majors

Rules for Calculating QuartilesStep 1 find the median of all the data (the median divides the data in half)

Step 2a find the median of the lower half this median is Q1Step 2b find the median of the upper half this median is Q3

Importantwhen n is odd include the overall median in both halveswhen n is even do not include the overall median in either half

Example 2 4 6 8 10 12 14 16 18 20 n = 10

Median m = (10+12)2 = 222 = 11

Q1 median of lower half 2 4 6 8 10

Q1 = 6

Q3 median of upper half 12 14 16 18 20

Q3 = 16

11

Pulse Rates n = 138

Stem Leaves4

3 4 5889 5 00123344410 5 555678889923 6 0001111112223333334444423 6 5555666666777778888888816 7 0000011222233444423 7 5555566666677788888899910 8 000011222410 8 55556677894 9 00122 9 584 10 0223

101 11 1

Median mean of pulses in locations 69 amp 70 median= (70+70)2=70

Q1 median of lower half (lower half = 69 smallest pulses) Q1 = pulse in ordered position 35Q1 = 63

Q3 median of upper half (upper half = 69 largest pulses) Q3= pulse in position 35 from the high end Q3=78

Below are the weights of 31 linemen on the NCSU football team What is the

value of the first quartile Q1

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 287

2 2575

3 2635

4 2625

Interquartile range another measure of spread

lower quartile Q1

middle quartile median upper quartile Q3

interquartile range (IQR)

IQR = Q3 ndash Q1

measures spread of middle 50 of the data

Example beginning pulse rates

Q3 = 78 Q1 = 63

IQR = 78 ndash 63 = 15

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 67: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Skewness comparing the mean and median

Skewed to the right (positively skewed) meangtmedian

53

490

102 7235 21 26 17 8 10 2 3 1 0 0 1

0

100

200

300

400

500

600

Freq

uenc

y

Salary ($1000s)

2011 Baseball Salaries

Skewed to the left negatively skewed

Mean lt median mean=78 median=87

Histogram of Exam Scores

0

10

20

30

20 30 40 50 60 70 80 90 100Exam Scores

Fre

qu

en

cy

Symmetric data

mean median approx equal

Bank Customers 1000-1100 am

0

5

10

15

20

Number of Customers

Fre

qu

en

cy

Section 33Describing Variability of Data

Standard Deviation

Using the Mean and Standard Deviation Together 68-95-997

Rule (Empirical Rule)

Recall 2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability

measures how ldquospread outrdquo the data is

Ways to measure variability

1 range=largest-smallest

ok sometimes in general too crude sensitive to one large or small obs

1

2 where

the middle is the mean

deviation of from the mean

( ) sum the deviations of all the s from

measure spread from the middle

i i

n

i ii

y

y y y

y y y y

1

( ) 0 always tells us nothingn

ii

y y

Example

1 2

1 2

1 2

1 2

sum of deviations from mean

49 51 50

( ) ( ) (49 50) (51 50) 1 1 0

0 100

Data set 1

Data set 2 50

( ) ( ) (0 50) (100 50) 50 50 0

x x x

x x x x

y y y

y y y y

The Sample Standard Deviation a measure of spread around the mean Square the deviation of each

observation from the mean find the square root of the ldquoaveragerdquo of these squared deviations

2

1

2

2 1

( )sample standard deviation

1

( )is called the sample variance

1

n

ii

n

ii

y ys

n

y ys

n

Calculations hellip

Mean = 634

Sum of squared deviations from mean = 852

(n minus 1) = 13 (n minus 1) is called degrees freedom (df)

s2 = variance = 85213 = 655 square inches

s = standard deviation = radic655 = 256 inches

Women height (inches)i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

2

1

2 )(1

1xx

ns

n

i

1 First calculate the variance s22 Then take the square root to get the

standard deviation s

2

1

)(1

1xx

ns

n

i

Meanplusmn 1 sd

Wersquoll never calculate these by hand so make sure to know how to get the standard deviation using your calculator Excel or other software

Population Standard Deviation

2

1

Denoted by the lower case Greek letter

is the size (for example =34000 for NCSU)

is the mean

( )population standard deviation

va

po

lue of typically not known

us

pulation

populatio

e

n

N

ii

N N

y

N

s

to estimate value of

Remarks

1 The standard deviation of a set of measurements is an estimate of the likely size of the chance error in a single measurement

Remarks (cont)

2 Note that s and s are always greater than or equal to zero

3 The larger the value of s (or s ) the greater the spread of the data

When does s=0 When does s =0

When all data values are the same

Remarks (cont)4 The standard deviation is the most

commonly used measure of risk in finance and businessndash Stocks Mutual Funds etc

5 Variance s2 sample variance 2 population variance Units are squared units of the original data square $ square gallons

Review Properties of s and s s and s are always greater than or

equal to 0

when does s = 0 s = 0 The larger the value of s (or s) the

greater the spread of the data the standard deviation of a set of

measurements is an estimate of the likely size of the chance error in a single measurement

Summary of Notation

2

SAMPLE

sample mean

sample median

sample variance

sample stand dev

y

m

s

s

2

POPULATION

population mean

population median

population variance

population stand dev

m

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

68-95-997 rule

Mean andStandard Deviation

(numerical)

Histogram(graphical)

68-95-997 rule

The 68-95-997 ruleIf the histogram of the data is

approximately bell-shaped then1) approximately of the measurements

are of the mean

that is in ( )

2) approximately of the measurement

68

within 1 standard deviation

95

within 2 standard deviation

s

are of the meas n

that is

y s y s

almost all

within 3 standard deviation

in ( 2 2 )

3) the measurements

are of the mean

that is in ( 3 3 )

s

y s y s

y s y s

68-95-997 rule 68 within 1 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

68

3434

y-s y y+s

68-95-997 rule 95 within 2 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

95

475 475

y-2s y y+2s

Example textbook costs

37548

4272

50

y

s

n

286 291 307 308 315 316 327328 340 342 346 347 348 348 349354 355 355 360 361 364 367 369371 373 377 380 381 382 385 385387 390 390 397 398 409 409 410418 422 424 425 426 428 433 434437 440 480

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( ) (33276 41820)

32percentage of data values in this interval 64

5068-95-997 rule 68

y s

y s y s

1 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 2 2 ) (29004 46092)

48percentage of data values in this interval 96

5068-95-997 rule 95

y s

y s y s

2 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 3 3 ) (24732 50364)

50percentage of data values in this interval 100

5068-95-997 rule 997

y s

y s y s

3 standard deviation interval about the mean

The best estimate of the standard deviation of the menrsquos weights

displayed in this dotplot is

1 10

2 15

3 20

4 40

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

Preceding slides Next

Z-scores Standardized Data Values

Measures the distance of a number from the mean in units of

the standard deviation

z-score corresponding to y

where

original data value

the sample mean

s the sample standard deviation

the z-score corresponding to

y yz

s

y

y

z y

Exam 1 y1 = 88 s1 = 6 your exam 1 score 91

Exam 2 y2 = 88 s2 = 10 your exam 2 score 92

Which score is better

1

2

91 88 3z 5

6 692 88 4

z 410 10

91 on exam 1 is better than 92 on exam 2

If data has mean and standard deviation

then standardizing a particular value of

indicates how many standard deviations

is above or below the mean

y s

y

y

y

Comparing SAT and ACT Scores

SAT Math Eleanorrsquos score 680

SAT mean =500 sd=100 ACT Math Geraldrsquos score 27

ACT mean=18 sd=6 Eleanorrsquos z-score z=(680-500)100=18 Geraldrsquos z-score z=(27-18)6=15 Eleanorrsquos score is better

Z-scores add to zeroStudentInstitutional Support to Athletic Depts For the 9 Public ACC

Schools 2013 ($ millions)

School Support y - ybar Z-score

Maryland 155 64 179

UVA 131 40 112

Louisville 109 18 050

UNC 92 01 003

VaTech 79 -12 -034

FSU 79 -12 -034

GaTech 71 -20 -056

NCSU 65 -26 -073

Clemson 38 -53 -147

Mean=91000 s=35697

Sum = 0 Sum = 0

Recently the mean tuition at 4-yr public collegesuniversities in the US was $6185 with a standard deviation of $1804 In NC the mean tuition was $4320 What is NCrsquos z-score

1 103

2 -103

3 239

4 1865

5 -1865

Section 34Measures of Position (also called Measures of Relative Standing)

Quartiles

5-Number Summary

Interquartile Range Another Measure of Spread

Boxplots

m = median = 34

Q1= first quartile = 23

Q3= third quartile = 42

1 1 062 2 123 3 164 4 195 5 156 6 217 7 238 6 239 5 2510 4 2811 3 2912 2 3313 1 3414 2 3615 3 3716 4 3817 5 3918 6 4119 7 4220 6 4521 5 4722 4 4923 3 5324 2 5625 1 61

Quartiles Measuring spread by examining the middleThe first quartile Q1 is the value in the

sample that has 25 of the data at or

below it (Q1 is the median of the lower

half of the sorted data)

The third quartile Q3 is the value in the

sample that has 75 of the data at or

below it (Q3 is the median of the upper

half of the sorted data)

Quartiles and median divide data into 4 pieces

Q1 M Q3

14 14 14 14

Quartiles are common measures of spread

httpoirpncsueduiradmit

httpoirpncsueduunivpeer

University of Southern California

Economic Value of College Majors

Rules for Calculating QuartilesStep 1 find the median of all the data (the median divides the data in half)

Step 2a find the median of the lower half this median is Q1Step 2b find the median of the upper half this median is Q3

Importantwhen n is odd include the overall median in both halveswhen n is even do not include the overall median in either half

Example 2 4 6 8 10 12 14 16 18 20 n = 10

Median m = (10+12)2 = 222 = 11

Q1 median of lower half 2 4 6 8 10

Q1 = 6

Q3 median of upper half 12 14 16 18 20

Q3 = 16

11

Pulse Rates n = 138

Stem Leaves4

3 4 5889 5 00123344410 5 555678889923 6 0001111112223333334444423 6 5555666666777778888888816 7 0000011222233444423 7 5555566666677788888899910 8 000011222410 8 55556677894 9 00122 9 584 10 0223

101 11 1

Median mean of pulses in locations 69 amp 70 median= (70+70)2=70

Q1 median of lower half (lower half = 69 smallest pulses) Q1 = pulse in ordered position 35Q1 = 63

Q3 median of upper half (upper half = 69 largest pulses) Q3= pulse in position 35 from the high end Q3=78

Below are the weights of 31 linemen on the NCSU football team What is the

value of the first quartile Q1

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 287

2 2575

3 2635

4 2625

Interquartile range another measure of spread

lower quartile Q1

middle quartile median upper quartile Q3

interquartile range (IQR)

IQR = Q3 ndash Q1

measures spread of middle 50 of the data

Example beginning pulse rates

Q3 = 78 Q1 = 63

IQR = 78 ndash 63 = 15

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 68: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Skewed to the left negatively skewed

Mean lt median mean=78 median=87

Histogram of Exam Scores

0

10

20

30

20 30 40 50 60 70 80 90 100Exam Scores

Fre

qu

en

cy

Symmetric data

mean median approx equal

Bank Customers 1000-1100 am

0

5

10

15

20

Number of Customers

Fre

qu

en

cy

Section 33Describing Variability of Data

Standard Deviation

Using the Mean and Standard Deviation Together 68-95-997

Rule (Empirical Rule)

Recall 2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability

measures how ldquospread outrdquo the data is

Ways to measure variability

1 range=largest-smallest

ok sometimes in general too crude sensitive to one large or small obs

1

2 where

the middle is the mean

deviation of from the mean

( ) sum the deviations of all the s from

measure spread from the middle

i i

n

i ii

y

y y y

y y y y

1

( ) 0 always tells us nothingn

ii

y y

Example

1 2

1 2

1 2

1 2

sum of deviations from mean

49 51 50

( ) ( ) (49 50) (51 50) 1 1 0

0 100

Data set 1

Data set 2 50

( ) ( ) (0 50) (100 50) 50 50 0

x x x

x x x x

y y y

y y y y

The Sample Standard Deviation a measure of spread around the mean Square the deviation of each

observation from the mean find the square root of the ldquoaveragerdquo of these squared deviations

2

1

2

2 1

( )sample standard deviation

1

( )is called the sample variance

1

n

ii

n

ii

y ys

n

y ys

n

Calculations hellip

Mean = 634

Sum of squared deviations from mean = 852

(n minus 1) = 13 (n minus 1) is called degrees freedom (df)

s2 = variance = 85213 = 655 square inches

s = standard deviation = radic655 = 256 inches

Women height (inches)i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

2

1

2 )(1

1xx

ns

n

i

1 First calculate the variance s22 Then take the square root to get the

standard deviation s

2

1

)(1

1xx

ns

n

i

Meanplusmn 1 sd

Wersquoll never calculate these by hand so make sure to know how to get the standard deviation using your calculator Excel or other software

Population Standard Deviation

2

1

Denoted by the lower case Greek letter

is the size (for example =34000 for NCSU)

is the mean

( )population standard deviation

va

po

lue of typically not known

us

pulation

populatio

e

n

N

ii

N N

y

N

s

to estimate value of

Remarks

1 The standard deviation of a set of measurements is an estimate of the likely size of the chance error in a single measurement

Remarks (cont)

2 Note that s and s are always greater than or equal to zero

3 The larger the value of s (or s ) the greater the spread of the data

When does s=0 When does s =0

When all data values are the same

Remarks (cont)4 The standard deviation is the most

commonly used measure of risk in finance and businessndash Stocks Mutual Funds etc

5 Variance s2 sample variance 2 population variance Units are squared units of the original data square $ square gallons

Review Properties of s and s s and s are always greater than or

equal to 0

when does s = 0 s = 0 The larger the value of s (or s) the

greater the spread of the data the standard deviation of a set of

measurements is an estimate of the likely size of the chance error in a single measurement

Summary of Notation

2

SAMPLE

sample mean

sample median

sample variance

sample stand dev

y

m

s

s

2

POPULATION

population mean

population median

population variance

population stand dev

m

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

68-95-997 rule

Mean andStandard Deviation

(numerical)

Histogram(graphical)

68-95-997 rule

The 68-95-997 ruleIf the histogram of the data is

approximately bell-shaped then1) approximately of the measurements

are of the mean

that is in ( )

2) approximately of the measurement

68

within 1 standard deviation

95

within 2 standard deviation

s

are of the meas n

that is

y s y s

almost all

within 3 standard deviation

in ( 2 2 )

3) the measurements

are of the mean

that is in ( 3 3 )

s

y s y s

y s y s

68-95-997 rule 68 within 1 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

68

3434

y-s y y+s

68-95-997 rule 95 within 2 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

95

475 475

y-2s y y+2s

Example textbook costs

37548

4272

50

y

s

n

286 291 307 308 315 316 327328 340 342 346 347 348 348 349354 355 355 360 361 364 367 369371 373 377 380 381 382 385 385387 390 390 397 398 409 409 410418 422 424 425 426 428 433 434437 440 480

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( ) (33276 41820)

32percentage of data values in this interval 64

5068-95-997 rule 68

y s

y s y s

1 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 2 2 ) (29004 46092)

48percentage of data values in this interval 96

5068-95-997 rule 95

y s

y s y s

2 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 3 3 ) (24732 50364)

50percentage of data values in this interval 100

5068-95-997 rule 997

y s

y s y s

3 standard deviation interval about the mean

The best estimate of the standard deviation of the menrsquos weights

displayed in this dotplot is

1 10

2 15

3 20

4 40

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

Preceding slides Next

Z-scores Standardized Data Values

Measures the distance of a number from the mean in units of

the standard deviation

z-score corresponding to y

where

original data value

the sample mean

s the sample standard deviation

the z-score corresponding to

y yz

s

y

y

z y

Exam 1 y1 = 88 s1 = 6 your exam 1 score 91

Exam 2 y2 = 88 s2 = 10 your exam 2 score 92

Which score is better

1

2

91 88 3z 5

6 692 88 4

z 410 10

91 on exam 1 is better than 92 on exam 2

If data has mean and standard deviation

then standardizing a particular value of

indicates how many standard deviations

is above or below the mean

y s

y

y

y

Comparing SAT and ACT Scores

SAT Math Eleanorrsquos score 680

SAT mean =500 sd=100 ACT Math Geraldrsquos score 27

ACT mean=18 sd=6 Eleanorrsquos z-score z=(680-500)100=18 Geraldrsquos z-score z=(27-18)6=15 Eleanorrsquos score is better

Z-scores add to zeroStudentInstitutional Support to Athletic Depts For the 9 Public ACC

Schools 2013 ($ millions)

School Support y - ybar Z-score

Maryland 155 64 179

UVA 131 40 112

Louisville 109 18 050

UNC 92 01 003

VaTech 79 -12 -034

FSU 79 -12 -034

GaTech 71 -20 -056

NCSU 65 -26 -073

Clemson 38 -53 -147

Mean=91000 s=35697

Sum = 0 Sum = 0

Recently the mean tuition at 4-yr public collegesuniversities in the US was $6185 with a standard deviation of $1804 In NC the mean tuition was $4320 What is NCrsquos z-score

1 103

2 -103

3 239

4 1865

5 -1865

Section 34Measures of Position (also called Measures of Relative Standing)

Quartiles

5-Number Summary

Interquartile Range Another Measure of Spread

Boxplots

m = median = 34

Q1= first quartile = 23

Q3= third quartile = 42

1 1 062 2 123 3 164 4 195 5 156 6 217 7 238 6 239 5 2510 4 2811 3 2912 2 3313 1 3414 2 3615 3 3716 4 3817 5 3918 6 4119 7 4220 6 4521 5 4722 4 4923 3 5324 2 5625 1 61

Quartiles Measuring spread by examining the middleThe first quartile Q1 is the value in the

sample that has 25 of the data at or

below it (Q1 is the median of the lower

half of the sorted data)

The third quartile Q3 is the value in the

sample that has 75 of the data at or

below it (Q3 is the median of the upper

half of the sorted data)

Quartiles and median divide data into 4 pieces

Q1 M Q3

14 14 14 14

Quartiles are common measures of spread

httpoirpncsueduiradmit

httpoirpncsueduunivpeer

University of Southern California

Economic Value of College Majors

Rules for Calculating QuartilesStep 1 find the median of all the data (the median divides the data in half)

Step 2a find the median of the lower half this median is Q1Step 2b find the median of the upper half this median is Q3

Importantwhen n is odd include the overall median in both halveswhen n is even do not include the overall median in either half

Example 2 4 6 8 10 12 14 16 18 20 n = 10

Median m = (10+12)2 = 222 = 11

Q1 median of lower half 2 4 6 8 10

Q1 = 6

Q3 median of upper half 12 14 16 18 20

Q3 = 16

11

Pulse Rates n = 138

Stem Leaves4

3 4 5889 5 00123344410 5 555678889923 6 0001111112223333334444423 6 5555666666777778888888816 7 0000011222233444423 7 5555566666677788888899910 8 000011222410 8 55556677894 9 00122 9 584 10 0223

101 11 1

Median mean of pulses in locations 69 amp 70 median= (70+70)2=70

Q1 median of lower half (lower half = 69 smallest pulses) Q1 = pulse in ordered position 35Q1 = 63

Q3 median of upper half (upper half = 69 largest pulses) Q3= pulse in position 35 from the high end Q3=78

Below are the weights of 31 linemen on the NCSU football team What is the

value of the first quartile Q1

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 287

2 2575

3 2635

4 2625

Interquartile range another measure of spread

lower quartile Q1

middle quartile median upper quartile Q3

interquartile range (IQR)

IQR = Q3 ndash Q1

measures spread of middle 50 of the data

Example beginning pulse rates

Q3 = 78 Q1 = 63

IQR = 78 ndash 63 = 15

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 69: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Symmetric data

mean median approx equal

Bank Customers 1000-1100 am

0

5

10

15

20

Number of Customers

Fre

qu

en

cy

Section 33Describing Variability of Data

Standard Deviation

Using the Mean and Standard Deviation Together 68-95-997

Rule (Empirical Rule)

Recall 2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability

measures how ldquospread outrdquo the data is

Ways to measure variability

1 range=largest-smallest

ok sometimes in general too crude sensitive to one large or small obs

1

2 where

the middle is the mean

deviation of from the mean

( ) sum the deviations of all the s from

measure spread from the middle

i i

n

i ii

y

y y y

y y y y

1

( ) 0 always tells us nothingn

ii

y y

Example

1 2

1 2

1 2

1 2

sum of deviations from mean

49 51 50

( ) ( ) (49 50) (51 50) 1 1 0

0 100

Data set 1

Data set 2 50

( ) ( ) (0 50) (100 50) 50 50 0

x x x

x x x x

y y y

y y y y

The Sample Standard Deviation a measure of spread around the mean Square the deviation of each

observation from the mean find the square root of the ldquoaveragerdquo of these squared deviations

2

1

2

2 1

( )sample standard deviation

1

( )is called the sample variance

1

n

ii

n

ii

y ys

n

y ys

n

Calculations hellip

Mean = 634

Sum of squared deviations from mean = 852

(n minus 1) = 13 (n minus 1) is called degrees freedom (df)

s2 = variance = 85213 = 655 square inches

s = standard deviation = radic655 = 256 inches

Women height (inches)i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

2

1

2 )(1

1xx

ns

n

i

1 First calculate the variance s22 Then take the square root to get the

standard deviation s

2

1

)(1

1xx

ns

n

i

Meanplusmn 1 sd

Wersquoll never calculate these by hand so make sure to know how to get the standard deviation using your calculator Excel or other software

Population Standard Deviation

2

1

Denoted by the lower case Greek letter

is the size (for example =34000 for NCSU)

is the mean

( )population standard deviation

va

po

lue of typically not known

us

pulation

populatio

e

n

N

ii

N N

y

N

s

to estimate value of

Remarks

1 The standard deviation of a set of measurements is an estimate of the likely size of the chance error in a single measurement

Remarks (cont)

2 Note that s and s are always greater than or equal to zero

3 The larger the value of s (or s ) the greater the spread of the data

When does s=0 When does s =0

When all data values are the same

Remarks (cont)4 The standard deviation is the most

commonly used measure of risk in finance and businessndash Stocks Mutual Funds etc

5 Variance s2 sample variance 2 population variance Units are squared units of the original data square $ square gallons

Review Properties of s and s s and s are always greater than or

equal to 0

when does s = 0 s = 0 The larger the value of s (or s) the

greater the spread of the data the standard deviation of a set of

measurements is an estimate of the likely size of the chance error in a single measurement

Summary of Notation

2

SAMPLE

sample mean

sample median

sample variance

sample stand dev

y

m

s

s

2

POPULATION

population mean

population median

population variance

population stand dev

m

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

68-95-997 rule

Mean andStandard Deviation

(numerical)

Histogram(graphical)

68-95-997 rule

The 68-95-997 ruleIf the histogram of the data is

approximately bell-shaped then1) approximately of the measurements

are of the mean

that is in ( )

2) approximately of the measurement

68

within 1 standard deviation

95

within 2 standard deviation

s

are of the meas n

that is

y s y s

almost all

within 3 standard deviation

in ( 2 2 )

3) the measurements

are of the mean

that is in ( 3 3 )

s

y s y s

y s y s

68-95-997 rule 68 within 1 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

68

3434

y-s y y+s

68-95-997 rule 95 within 2 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

95

475 475

y-2s y y+2s

Example textbook costs

37548

4272

50

y

s

n

286 291 307 308 315 316 327328 340 342 346 347 348 348 349354 355 355 360 361 364 367 369371 373 377 380 381 382 385 385387 390 390 397 398 409 409 410418 422 424 425 426 428 433 434437 440 480

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( ) (33276 41820)

32percentage of data values in this interval 64

5068-95-997 rule 68

y s

y s y s

1 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 2 2 ) (29004 46092)

48percentage of data values in this interval 96

5068-95-997 rule 95

y s

y s y s

2 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 3 3 ) (24732 50364)

50percentage of data values in this interval 100

5068-95-997 rule 997

y s

y s y s

3 standard deviation interval about the mean

The best estimate of the standard deviation of the menrsquos weights

displayed in this dotplot is

1 10

2 15

3 20

4 40

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

Preceding slides Next

Z-scores Standardized Data Values

Measures the distance of a number from the mean in units of

the standard deviation

z-score corresponding to y

where

original data value

the sample mean

s the sample standard deviation

the z-score corresponding to

y yz

s

y

y

z y

Exam 1 y1 = 88 s1 = 6 your exam 1 score 91

Exam 2 y2 = 88 s2 = 10 your exam 2 score 92

Which score is better

1

2

91 88 3z 5

6 692 88 4

z 410 10

91 on exam 1 is better than 92 on exam 2

If data has mean and standard deviation

then standardizing a particular value of

indicates how many standard deviations

is above or below the mean

y s

y

y

y

Comparing SAT and ACT Scores

SAT Math Eleanorrsquos score 680

SAT mean =500 sd=100 ACT Math Geraldrsquos score 27

ACT mean=18 sd=6 Eleanorrsquos z-score z=(680-500)100=18 Geraldrsquos z-score z=(27-18)6=15 Eleanorrsquos score is better

Z-scores add to zeroStudentInstitutional Support to Athletic Depts For the 9 Public ACC

Schools 2013 ($ millions)

School Support y - ybar Z-score

Maryland 155 64 179

UVA 131 40 112

Louisville 109 18 050

UNC 92 01 003

VaTech 79 -12 -034

FSU 79 -12 -034

GaTech 71 -20 -056

NCSU 65 -26 -073

Clemson 38 -53 -147

Mean=91000 s=35697

Sum = 0 Sum = 0

Recently the mean tuition at 4-yr public collegesuniversities in the US was $6185 with a standard deviation of $1804 In NC the mean tuition was $4320 What is NCrsquos z-score

1 103

2 -103

3 239

4 1865

5 -1865

Section 34Measures of Position (also called Measures of Relative Standing)

Quartiles

5-Number Summary

Interquartile Range Another Measure of Spread

Boxplots

m = median = 34

Q1= first quartile = 23

Q3= third quartile = 42

1 1 062 2 123 3 164 4 195 5 156 6 217 7 238 6 239 5 2510 4 2811 3 2912 2 3313 1 3414 2 3615 3 3716 4 3817 5 3918 6 4119 7 4220 6 4521 5 4722 4 4923 3 5324 2 5625 1 61

Quartiles Measuring spread by examining the middleThe first quartile Q1 is the value in the

sample that has 25 of the data at or

below it (Q1 is the median of the lower

half of the sorted data)

The third quartile Q3 is the value in the

sample that has 75 of the data at or

below it (Q3 is the median of the upper

half of the sorted data)

Quartiles and median divide data into 4 pieces

Q1 M Q3

14 14 14 14

Quartiles are common measures of spread

httpoirpncsueduiradmit

httpoirpncsueduunivpeer

University of Southern California

Economic Value of College Majors

Rules for Calculating QuartilesStep 1 find the median of all the data (the median divides the data in half)

Step 2a find the median of the lower half this median is Q1Step 2b find the median of the upper half this median is Q3

Importantwhen n is odd include the overall median in both halveswhen n is even do not include the overall median in either half

Example 2 4 6 8 10 12 14 16 18 20 n = 10

Median m = (10+12)2 = 222 = 11

Q1 median of lower half 2 4 6 8 10

Q1 = 6

Q3 median of upper half 12 14 16 18 20

Q3 = 16

11

Pulse Rates n = 138

Stem Leaves4

3 4 5889 5 00123344410 5 555678889923 6 0001111112223333334444423 6 5555666666777778888888816 7 0000011222233444423 7 5555566666677788888899910 8 000011222410 8 55556677894 9 00122 9 584 10 0223

101 11 1

Median mean of pulses in locations 69 amp 70 median= (70+70)2=70

Q1 median of lower half (lower half = 69 smallest pulses) Q1 = pulse in ordered position 35Q1 = 63

Q3 median of upper half (upper half = 69 largest pulses) Q3= pulse in position 35 from the high end Q3=78

Below are the weights of 31 linemen on the NCSU football team What is the

value of the first quartile Q1

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 287

2 2575

3 2635

4 2625

Interquartile range another measure of spread

lower quartile Q1

middle quartile median upper quartile Q3

interquartile range (IQR)

IQR = Q3 ndash Q1

measures spread of middle 50 of the data

Example beginning pulse rates

Q3 = 78 Q1 = 63

IQR = 78 ndash 63 = 15

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 70: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Section 33Describing Variability of Data

Standard Deviation

Using the Mean and Standard Deviation Together 68-95-997

Rule (Empirical Rule)

Recall 2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability

measures how ldquospread outrdquo the data is

Ways to measure variability

1 range=largest-smallest

ok sometimes in general too crude sensitive to one large or small obs

1

2 where

the middle is the mean

deviation of from the mean

( ) sum the deviations of all the s from

measure spread from the middle

i i

n

i ii

y

y y y

y y y y

1

( ) 0 always tells us nothingn

ii

y y

Example

1 2

1 2

1 2

1 2

sum of deviations from mean

49 51 50

( ) ( ) (49 50) (51 50) 1 1 0

0 100

Data set 1

Data set 2 50

( ) ( ) (0 50) (100 50) 50 50 0

x x x

x x x x

y y y

y y y y

The Sample Standard Deviation a measure of spread around the mean Square the deviation of each

observation from the mean find the square root of the ldquoaveragerdquo of these squared deviations

2

1

2

2 1

( )sample standard deviation

1

( )is called the sample variance

1

n

ii

n

ii

y ys

n

y ys

n

Calculations hellip

Mean = 634

Sum of squared deviations from mean = 852

(n minus 1) = 13 (n minus 1) is called degrees freedom (df)

s2 = variance = 85213 = 655 square inches

s = standard deviation = radic655 = 256 inches

Women height (inches)i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

2

1

2 )(1

1xx

ns

n

i

1 First calculate the variance s22 Then take the square root to get the

standard deviation s

2

1

)(1

1xx

ns

n

i

Meanplusmn 1 sd

Wersquoll never calculate these by hand so make sure to know how to get the standard deviation using your calculator Excel or other software

Population Standard Deviation

2

1

Denoted by the lower case Greek letter

is the size (for example =34000 for NCSU)

is the mean

( )population standard deviation

va

po

lue of typically not known

us

pulation

populatio

e

n

N

ii

N N

y

N

s

to estimate value of

Remarks

1 The standard deviation of a set of measurements is an estimate of the likely size of the chance error in a single measurement

Remarks (cont)

2 Note that s and s are always greater than or equal to zero

3 The larger the value of s (or s ) the greater the spread of the data

When does s=0 When does s =0

When all data values are the same

Remarks (cont)4 The standard deviation is the most

commonly used measure of risk in finance and businessndash Stocks Mutual Funds etc

5 Variance s2 sample variance 2 population variance Units are squared units of the original data square $ square gallons

Review Properties of s and s s and s are always greater than or

equal to 0

when does s = 0 s = 0 The larger the value of s (or s) the

greater the spread of the data the standard deviation of a set of

measurements is an estimate of the likely size of the chance error in a single measurement

Summary of Notation

2

SAMPLE

sample mean

sample median

sample variance

sample stand dev

y

m

s

s

2

POPULATION

population mean

population median

population variance

population stand dev

m

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

68-95-997 rule

Mean andStandard Deviation

(numerical)

Histogram(graphical)

68-95-997 rule

The 68-95-997 ruleIf the histogram of the data is

approximately bell-shaped then1) approximately of the measurements

are of the mean

that is in ( )

2) approximately of the measurement

68

within 1 standard deviation

95

within 2 standard deviation

s

are of the meas n

that is

y s y s

almost all

within 3 standard deviation

in ( 2 2 )

3) the measurements

are of the mean

that is in ( 3 3 )

s

y s y s

y s y s

68-95-997 rule 68 within 1 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

68

3434

y-s y y+s

68-95-997 rule 95 within 2 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

95

475 475

y-2s y y+2s

Example textbook costs

37548

4272

50

y

s

n

286 291 307 308 315 316 327328 340 342 346 347 348 348 349354 355 355 360 361 364 367 369371 373 377 380 381 382 385 385387 390 390 397 398 409 409 410418 422 424 425 426 428 433 434437 440 480

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( ) (33276 41820)

32percentage of data values in this interval 64

5068-95-997 rule 68

y s

y s y s

1 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 2 2 ) (29004 46092)

48percentage of data values in this interval 96

5068-95-997 rule 95

y s

y s y s

2 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 3 3 ) (24732 50364)

50percentage of data values in this interval 100

5068-95-997 rule 997

y s

y s y s

3 standard deviation interval about the mean

The best estimate of the standard deviation of the menrsquos weights

displayed in this dotplot is

1 10

2 15

3 20

4 40

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

Preceding slides Next

Z-scores Standardized Data Values

Measures the distance of a number from the mean in units of

the standard deviation

z-score corresponding to y

where

original data value

the sample mean

s the sample standard deviation

the z-score corresponding to

y yz

s

y

y

z y

Exam 1 y1 = 88 s1 = 6 your exam 1 score 91

Exam 2 y2 = 88 s2 = 10 your exam 2 score 92

Which score is better

1

2

91 88 3z 5

6 692 88 4

z 410 10

91 on exam 1 is better than 92 on exam 2

If data has mean and standard deviation

then standardizing a particular value of

indicates how many standard deviations

is above or below the mean

y s

y

y

y

Comparing SAT and ACT Scores

SAT Math Eleanorrsquos score 680

SAT mean =500 sd=100 ACT Math Geraldrsquos score 27

ACT mean=18 sd=6 Eleanorrsquos z-score z=(680-500)100=18 Geraldrsquos z-score z=(27-18)6=15 Eleanorrsquos score is better

Z-scores add to zeroStudentInstitutional Support to Athletic Depts For the 9 Public ACC

Schools 2013 ($ millions)

School Support y - ybar Z-score

Maryland 155 64 179

UVA 131 40 112

Louisville 109 18 050

UNC 92 01 003

VaTech 79 -12 -034

FSU 79 -12 -034

GaTech 71 -20 -056

NCSU 65 -26 -073

Clemson 38 -53 -147

Mean=91000 s=35697

Sum = 0 Sum = 0

Recently the mean tuition at 4-yr public collegesuniversities in the US was $6185 with a standard deviation of $1804 In NC the mean tuition was $4320 What is NCrsquos z-score

1 103

2 -103

3 239

4 1865

5 -1865

Section 34Measures of Position (also called Measures of Relative Standing)

Quartiles

5-Number Summary

Interquartile Range Another Measure of Spread

Boxplots

m = median = 34

Q1= first quartile = 23

Q3= third quartile = 42

1 1 062 2 123 3 164 4 195 5 156 6 217 7 238 6 239 5 2510 4 2811 3 2912 2 3313 1 3414 2 3615 3 3716 4 3817 5 3918 6 4119 7 4220 6 4521 5 4722 4 4923 3 5324 2 5625 1 61

Quartiles Measuring spread by examining the middleThe first quartile Q1 is the value in the

sample that has 25 of the data at or

below it (Q1 is the median of the lower

half of the sorted data)

The third quartile Q3 is the value in the

sample that has 75 of the data at or

below it (Q3 is the median of the upper

half of the sorted data)

Quartiles and median divide data into 4 pieces

Q1 M Q3

14 14 14 14

Quartiles are common measures of spread

httpoirpncsueduiradmit

httpoirpncsueduunivpeer

University of Southern California

Economic Value of College Majors

Rules for Calculating QuartilesStep 1 find the median of all the data (the median divides the data in half)

Step 2a find the median of the lower half this median is Q1Step 2b find the median of the upper half this median is Q3

Importantwhen n is odd include the overall median in both halveswhen n is even do not include the overall median in either half

Example 2 4 6 8 10 12 14 16 18 20 n = 10

Median m = (10+12)2 = 222 = 11

Q1 median of lower half 2 4 6 8 10

Q1 = 6

Q3 median of upper half 12 14 16 18 20

Q3 = 16

11

Pulse Rates n = 138

Stem Leaves4

3 4 5889 5 00123344410 5 555678889923 6 0001111112223333334444423 6 5555666666777778888888816 7 0000011222233444423 7 5555566666677788888899910 8 000011222410 8 55556677894 9 00122 9 584 10 0223

101 11 1

Median mean of pulses in locations 69 amp 70 median= (70+70)2=70

Q1 median of lower half (lower half = 69 smallest pulses) Q1 = pulse in ordered position 35Q1 = 63

Q3 median of upper half (upper half = 69 largest pulses) Q3= pulse in position 35 from the high end Q3=78

Below are the weights of 31 linemen on the NCSU football team What is the

value of the first quartile Q1

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 287

2 2575

3 2635

4 2625

Interquartile range another measure of spread

lower quartile Q1

middle quartile median upper quartile Q3

interquartile range (IQR)

IQR = Q3 ndash Q1

measures spread of middle 50 of the data

Example beginning pulse rates

Q3 = 78 Q1 = 63

IQR = 78 ndash 63 = 15

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 71: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Recall 2 characteristics of a data set to measure

center

measures where the ldquomiddlerdquo of the data is located

variability

measures how ldquospread outrdquo the data is

Ways to measure variability

1 range=largest-smallest

ok sometimes in general too crude sensitive to one large or small obs

1

2 where

the middle is the mean

deviation of from the mean

( ) sum the deviations of all the s from

measure spread from the middle

i i

n

i ii

y

y y y

y y y y

1

( ) 0 always tells us nothingn

ii

y y

Example

1 2

1 2

1 2

1 2

sum of deviations from mean

49 51 50

( ) ( ) (49 50) (51 50) 1 1 0

0 100

Data set 1

Data set 2 50

( ) ( ) (0 50) (100 50) 50 50 0

x x x

x x x x

y y y

y y y y

The Sample Standard Deviation a measure of spread around the mean Square the deviation of each

observation from the mean find the square root of the ldquoaveragerdquo of these squared deviations

2

1

2

2 1

( )sample standard deviation

1

( )is called the sample variance

1

n

ii

n

ii

y ys

n

y ys

n

Calculations hellip

Mean = 634

Sum of squared deviations from mean = 852

(n minus 1) = 13 (n minus 1) is called degrees freedom (df)

s2 = variance = 85213 = 655 square inches

s = standard deviation = radic655 = 256 inches

Women height (inches)i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

2

1

2 )(1

1xx

ns

n

i

1 First calculate the variance s22 Then take the square root to get the

standard deviation s

2

1

)(1

1xx

ns

n

i

Meanplusmn 1 sd

Wersquoll never calculate these by hand so make sure to know how to get the standard deviation using your calculator Excel or other software

Population Standard Deviation

2

1

Denoted by the lower case Greek letter

is the size (for example =34000 for NCSU)

is the mean

( )population standard deviation

va

po

lue of typically not known

us

pulation

populatio

e

n

N

ii

N N

y

N

s

to estimate value of

Remarks

1 The standard deviation of a set of measurements is an estimate of the likely size of the chance error in a single measurement

Remarks (cont)

2 Note that s and s are always greater than or equal to zero

3 The larger the value of s (or s ) the greater the spread of the data

When does s=0 When does s =0

When all data values are the same

Remarks (cont)4 The standard deviation is the most

commonly used measure of risk in finance and businessndash Stocks Mutual Funds etc

5 Variance s2 sample variance 2 population variance Units are squared units of the original data square $ square gallons

Review Properties of s and s s and s are always greater than or

equal to 0

when does s = 0 s = 0 The larger the value of s (or s) the

greater the spread of the data the standard deviation of a set of

measurements is an estimate of the likely size of the chance error in a single measurement

Summary of Notation

2

SAMPLE

sample mean

sample median

sample variance

sample stand dev

y

m

s

s

2

POPULATION

population mean

population median

population variance

population stand dev

m

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

68-95-997 rule

Mean andStandard Deviation

(numerical)

Histogram(graphical)

68-95-997 rule

The 68-95-997 ruleIf the histogram of the data is

approximately bell-shaped then1) approximately of the measurements

are of the mean

that is in ( )

2) approximately of the measurement

68

within 1 standard deviation

95

within 2 standard deviation

s

are of the meas n

that is

y s y s

almost all

within 3 standard deviation

in ( 2 2 )

3) the measurements

are of the mean

that is in ( 3 3 )

s

y s y s

y s y s

68-95-997 rule 68 within 1 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

68

3434

y-s y y+s

68-95-997 rule 95 within 2 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

95

475 475

y-2s y y+2s

Example textbook costs

37548

4272

50

y

s

n

286 291 307 308 315 316 327328 340 342 346 347 348 348 349354 355 355 360 361 364 367 369371 373 377 380 381 382 385 385387 390 390 397 398 409 409 410418 422 424 425 426 428 433 434437 440 480

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( ) (33276 41820)

32percentage of data values in this interval 64

5068-95-997 rule 68

y s

y s y s

1 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 2 2 ) (29004 46092)

48percentage of data values in this interval 96

5068-95-997 rule 95

y s

y s y s

2 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 3 3 ) (24732 50364)

50percentage of data values in this interval 100

5068-95-997 rule 997

y s

y s y s

3 standard deviation interval about the mean

The best estimate of the standard deviation of the menrsquos weights

displayed in this dotplot is

1 10

2 15

3 20

4 40

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

Preceding slides Next

Z-scores Standardized Data Values

Measures the distance of a number from the mean in units of

the standard deviation

z-score corresponding to y

where

original data value

the sample mean

s the sample standard deviation

the z-score corresponding to

y yz

s

y

y

z y

Exam 1 y1 = 88 s1 = 6 your exam 1 score 91

Exam 2 y2 = 88 s2 = 10 your exam 2 score 92

Which score is better

1

2

91 88 3z 5

6 692 88 4

z 410 10

91 on exam 1 is better than 92 on exam 2

If data has mean and standard deviation

then standardizing a particular value of

indicates how many standard deviations

is above or below the mean

y s

y

y

y

Comparing SAT and ACT Scores

SAT Math Eleanorrsquos score 680

SAT mean =500 sd=100 ACT Math Geraldrsquos score 27

ACT mean=18 sd=6 Eleanorrsquos z-score z=(680-500)100=18 Geraldrsquos z-score z=(27-18)6=15 Eleanorrsquos score is better

Z-scores add to zeroStudentInstitutional Support to Athletic Depts For the 9 Public ACC

Schools 2013 ($ millions)

School Support y - ybar Z-score

Maryland 155 64 179

UVA 131 40 112

Louisville 109 18 050

UNC 92 01 003

VaTech 79 -12 -034

FSU 79 -12 -034

GaTech 71 -20 -056

NCSU 65 -26 -073

Clemson 38 -53 -147

Mean=91000 s=35697

Sum = 0 Sum = 0

Recently the mean tuition at 4-yr public collegesuniversities in the US was $6185 with a standard deviation of $1804 In NC the mean tuition was $4320 What is NCrsquos z-score

1 103

2 -103

3 239

4 1865

5 -1865

Section 34Measures of Position (also called Measures of Relative Standing)

Quartiles

5-Number Summary

Interquartile Range Another Measure of Spread

Boxplots

m = median = 34

Q1= first quartile = 23

Q3= third quartile = 42

1 1 062 2 123 3 164 4 195 5 156 6 217 7 238 6 239 5 2510 4 2811 3 2912 2 3313 1 3414 2 3615 3 3716 4 3817 5 3918 6 4119 7 4220 6 4521 5 4722 4 4923 3 5324 2 5625 1 61

Quartiles Measuring spread by examining the middleThe first quartile Q1 is the value in the

sample that has 25 of the data at or

below it (Q1 is the median of the lower

half of the sorted data)

The third quartile Q3 is the value in the

sample that has 75 of the data at or

below it (Q3 is the median of the upper

half of the sorted data)

Quartiles and median divide data into 4 pieces

Q1 M Q3

14 14 14 14

Quartiles are common measures of spread

httpoirpncsueduiradmit

httpoirpncsueduunivpeer

University of Southern California

Economic Value of College Majors

Rules for Calculating QuartilesStep 1 find the median of all the data (the median divides the data in half)

Step 2a find the median of the lower half this median is Q1Step 2b find the median of the upper half this median is Q3

Importantwhen n is odd include the overall median in both halveswhen n is even do not include the overall median in either half

Example 2 4 6 8 10 12 14 16 18 20 n = 10

Median m = (10+12)2 = 222 = 11

Q1 median of lower half 2 4 6 8 10

Q1 = 6

Q3 median of upper half 12 14 16 18 20

Q3 = 16

11

Pulse Rates n = 138

Stem Leaves4

3 4 5889 5 00123344410 5 555678889923 6 0001111112223333334444423 6 5555666666777778888888816 7 0000011222233444423 7 5555566666677788888899910 8 000011222410 8 55556677894 9 00122 9 584 10 0223

101 11 1

Median mean of pulses in locations 69 amp 70 median= (70+70)2=70

Q1 median of lower half (lower half = 69 smallest pulses) Q1 = pulse in ordered position 35Q1 = 63

Q3 median of upper half (upper half = 69 largest pulses) Q3= pulse in position 35 from the high end Q3=78

Below are the weights of 31 linemen on the NCSU football team What is the

value of the first quartile Q1

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 287

2 2575

3 2635

4 2625

Interquartile range another measure of spread

lower quartile Q1

middle quartile median upper quartile Q3

interquartile range (IQR)

IQR = Q3 ndash Q1

measures spread of middle 50 of the data

Example beginning pulse rates

Q3 = 78 Q1 = 63

IQR = 78 ndash 63 = 15

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 72: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Ways to measure variability

1 range=largest-smallest

ok sometimes in general too crude sensitive to one large or small obs

1

2 where

the middle is the mean

deviation of from the mean

( ) sum the deviations of all the s from

measure spread from the middle

i i

n

i ii

y

y y y

y y y y

1

( ) 0 always tells us nothingn

ii

y y

Example

1 2

1 2

1 2

1 2

sum of deviations from mean

49 51 50

( ) ( ) (49 50) (51 50) 1 1 0

0 100

Data set 1

Data set 2 50

( ) ( ) (0 50) (100 50) 50 50 0

x x x

x x x x

y y y

y y y y

The Sample Standard Deviation a measure of spread around the mean Square the deviation of each

observation from the mean find the square root of the ldquoaveragerdquo of these squared deviations

2

1

2

2 1

( )sample standard deviation

1

( )is called the sample variance

1

n

ii

n

ii

y ys

n

y ys

n

Calculations hellip

Mean = 634

Sum of squared deviations from mean = 852

(n minus 1) = 13 (n minus 1) is called degrees freedom (df)

s2 = variance = 85213 = 655 square inches

s = standard deviation = radic655 = 256 inches

Women height (inches)i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

2

1

2 )(1

1xx

ns

n

i

1 First calculate the variance s22 Then take the square root to get the

standard deviation s

2

1

)(1

1xx

ns

n

i

Meanplusmn 1 sd

Wersquoll never calculate these by hand so make sure to know how to get the standard deviation using your calculator Excel or other software

Population Standard Deviation

2

1

Denoted by the lower case Greek letter

is the size (for example =34000 for NCSU)

is the mean

( )population standard deviation

va

po

lue of typically not known

us

pulation

populatio

e

n

N

ii

N N

y

N

s

to estimate value of

Remarks

1 The standard deviation of a set of measurements is an estimate of the likely size of the chance error in a single measurement

Remarks (cont)

2 Note that s and s are always greater than or equal to zero

3 The larger the value of s (or s ) the greater the spread of the data

When does s=0 When does s =0

When all data values are the same

Remarks (cont)4 The standard deviation is the most

commonly used measure of risk in finance and businessndash Stocks Mutual Funds etc

5 Variance s2 sample variance 2 population variance Units are squared units of the original data square $ square gallons

Review Properties of s and s s and s are always greater than or

equal to 0

when does s = 0 s = 0 The larger the value of s (or s) the

greater the spread of the data the standard deviation of a set of

measurements is an estimate of the likely size of the chance error in a single measurement

Summary of Notation

2

SAMPLE

sample mean

sample median

sample variance

sample stand dev

y

m

s

s

2

POPULATION

population mean

population median

population variance

population stand dev

m

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

68-95-997 rule

Mean andStandard Deviation

(numerical)

Histogram(graphical)

68-95-997 rule

The 68-95-997 ruleIf the histogram of the data is

approximately bell-shaped then1) approximately of the measurements

are of the mean

that is in ( )

2) approximately of the measurement

68

within 1 standard deviation

95

within 2 standard deviation

s

are of the meas n

that is

y s y s

almost all

within 3 standard deviation

in ( 2 2 )

3) the measurements

are of the mean

that is in ( 3 3 )

s

y s y s

y s y s

68-95-997 rule 68 within 1 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

68

3434

y-s y y+s

68-95-997 rule 95 within 2 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

95

475 475

y-2s y y+2s

Example textbook costs

37548

4272

50

y

s

n

286 291 307 308 315 316 327328 340 342 346 347 348 348 349354 355 355 360 361 364 367 369371 373 377 380 381 382 385 385387 390 390 397 398 409 409 410418 422 424 425 426 428 433 434437 440 480

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( ) (33276 41820)

32percentage of data values in this interval 64

5068-95-997 rule 68

y s

y s y s

1 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 2 2 ) (29004 46092)

48percentage of data values in this interval 96

5068-95-997 rule 95

y s

y s y s

2 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 3 3 ) (24732 50364)

50percentage of data values in this interval 100

5068-95-997 rule 997

y s

y s y s

3 standard deviation interval about the mean

The best estimate of the standard deviation of the menrsquos weights

displayed in this dotplot is

1 10

2 15

3 20

4 40

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

Preceding slides Next

Z-scores Standardized Data Values

Measures the distance of a number from the mean in units of

the standard deviation

z-score corresponding to y

where

original data value

the sample mean

s the sample standard deviation

the z-score corresponding to

y yz

s

y

y

z y

Exam 1 y1 = 88 s1 = 6 your exam 1 score 91

Exam 2 y2 = 88 s2 = 10 your exam 2 score 92

Which score is better

1

2

91 88 3z 5

6 692 88 4

z 410 10

91 on exam 1 is better than 92 on exam 2

If data has mean and standard deviation

then standardizing a particular value of

indicates how many standard deviations

is above or below the mean

y s

y

y

y

Comparing SAT and ACT Scores

SAT Math Eleanorrsquos score 680

SAT mean =500 sd=100 ACT Math Geraldrsquos score 27

ACT mean=18 sd=6 Eleanorrsquos z-score z=(680-500)100=18 Geraldrsquos z-score z=(27-18)6=15 Eleanorrsquos score is better

Z-scores add to zeroStudentInstitutional Support to Athletic Depts For the 9 Public ACC

Schools 2013 ($ millions)

School Support y - ybar Z-score

Maryland 155 64 179

UVA 131 40 112

Louisville 109 18 050

UNC 92 01 003

VaTech 79 -12 -034

FSU 79 -12 -034

GaTech 71 -20 -056

NCSU 65 -26 -073

Clemson 38 -53 -147

Mean=91000 s=35697

Sum = 0 Sum = 0

Recently the mean tuition at 4-yr public collegesuniversities in the US was $6185 with a standard deviation of $1804 In NC the mean tuition was $4320 What is NCrsquos z-score

1 103

2 -103

3 239

4 1865

5 -1865

Section 34Measures of Position (also called Measures of Relative Standing)

Quartiles

5-Number Summary

Interquartile Range Another Measure of Spread

Boxplots

m = median = 34

Q1= first quartile = 23

Q3= third quartile = 42

1 1 062 2 123 3 164 4 195 5 156 6 217 7 238 6 239 5 2510 4 2811 3 2912 2 3313 1 3414 2 3615 3 3716 4 3817 5 3918 6 4119 7 4220 6 4521 5 4722 4 4923 3 5324 2 5625 1 61

Quartiles Measuring spread by examining the middleThe first quartile Q1 is the value in the

sample that has 25 of the data at or

below it (Q1 is the median of the lower

half of the sorted data)

The third quartile Q3 is the value in the

sample that has 75 of the data at or

below it (Q3 is the median of the upper

half of the sorted data)

Quartiles and median divide data into 4 pieces

Q1 M Q3

14 14 14 14

Quartiles are common measures of spread

httpoirpncsueduiradmit

httpoirpncsueduunivpeer

University of Southern California

Economic Value of College Majors

Rules for Calculating QuartilesStep 1 find the median of all the data (the median divides the data in half)

Step 2a find the median of the lower half this median is Q1Step 2b find the median of the upper half this median is Q3

Importantwhen n is odd include the overall median in both halveswhen n is even do not include the overall median in either half

Example 2 4 6 8 10 12 14 16 18 20 n = 10

Median m = (10+12)2 = 222 = 11

Q1 median of lower half 2 4 6 8 10

Q1 = 6

Q3 median of upper half 12 14 16 18 20

Q3 = 16

11

Pulse Rates n = 138

Stem Leaves4

3 4 5889 5 00123344410 5 555678889923 6 0001111112223333334444423 6 5555666666777778888888816 7 0000011222233444423 7 5555566666677788888899910 8 000011222410 8 55556677894 9 00122 9 584 10 0223

101 11 1

Median mean of pulses in locations 69 amp 70 median= (70+70)2=70

Q1 median of lower half (lower half = 69 smallest pulses) Q1 = pulse in ordered position 35Q1 = 63

Q3 median of upper half (upper half = 69 largest pulses) Q3= pulse in position 35 from the high end Q3=78

Below are the weights of 31 linemen on the NCSU football team What is the

value of the first quartile Q1

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 287

2 2575

3 2635

4 2625

Interquartile range another measure of spread

lower quartile Q1

middle quartile median upper quartile Q3

interquartile range (IQR)

IQR = Q3 ndash Q1

measures spread of middle 50 of the data

Example beginning pulse rates

Q3 = 78 Q1 = 63

IQR = 78 ndash 63 = 15

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 73: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Example

1 2

1 2

1 2

1 2

sum of deviations from mean

49 51 50

( ) ( ) (49 50) (51 50) 1 1 0

0 100

Data set 1

Data set 2 50

( ) ( ) (0 50) (100 50) 50 50 0

x x x

x x x x

y y y

y y y y

The Sample Standard Deviation a measure of spread around the mean Square the deviation of each

observation from the mean find the square root of the ldquoaveragerdquo of these squared deviations

2

1

2

2 1

( )sample standard deviation

1

( )is called the sample variance

1

n

ii

n

ii

y ys

n

y ys

n

Calculations hellip

Mean = 634

Sum of squared deviations from mean = 852

(n minus 1) = 13 (n minus 1) is called degrees freedom (df)

s2 = variance = 85213 = 655 square inches

s = standard deviation = radic655 = 256 inches

Women height (inches)i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

2

1

2 )(1

1xx

ns

n

i

1 First calculate the variance s22 Then take the square root to get the

standard deviation s

2

1

)(1

1xx

ns

n

i

Meanplusmn 1 sd

Wersquoll never calculate these by hand so make sure to know how to get the standard deviation using your calculator Excel or other software

Population Standard Deviation

2

1

Denoted by the lower case Greek letter

is the size (for example =34000 for NCSU)

is the mean

( )population standard deviation

va

po

lue of typically not known

us

pulation

populatio

e

n

N

ii

N N

y

N

s

to estimate value of

Remarks

1 The standard deviation of a set of measurements is an estimate of the likely size of the chance error in a single measurement

Remarks (cont)

2 Note that s and s are always greater than or equal to zero

3 The larger the value of s (or s ) the greater the spread of the data

When does s=0 When does s =0

When all data values are the same

Remarks (cont)4 The standard deviation is the most

commonly used measure of risk in finance and businessndash Stocks Mutual Funds etc

5 Variance s2 sample variance 2 population variance Units are squared units of the original data square $ square gallons

Review Properties of s and s s and s are always greater than or

equal to 0

when does s = 0 s = 0 The larger the value of s (or s) the

greater the spread of the data the standard deviation of a set of

measurements is an estimate of the likely size of the chance error in a single measurement

Summary of Notation

2

SAMPLE

sample mean

sample median

sample variance

sample stand dev

y

m

s

s

2

POPULATION

population mean

population median

population variance

population stand dev

m

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

68-95-997 rule

Mean andStandard Deviation

(numerical)

Histogram(graphical)

68-95-997 rule

The 68-95-997 ruleIf the histogram of the data is

approximately bell-shaped then1) approximately of the measurements

are of the mean

that is in ( )

2) approximately of the measurement

68

within 1 standard deviation

95

within 2 standard deviation

s

are of the meas n

that is

y s y s

almost all

within 3 standard deviation

in ( 2 2 )

3) the measurements

are of the mean

that is in ( 3 3 )

s

y s y s

y s y s

68-95-997 rule 68 within 1 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

68

3434

y-s y y+s

68-95-997 rule 95 within 2 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

95

475 475

y-2s y y+2s

Example textbook costs

37548

4272

50

y

s

n

286 291 307 308 315 316 327328 340 342 346 347 348 348 349354 355 355 360 361 364 367 369371 373 377 380 381 382 385 385387 390 390 397 398 409 409 410418 422 424 425 426 428 433 434437 440 480

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( ) (33276 41820)

32percentage of data values in this interval 64

5068-95-997 rule 68

y s

y s y s

1 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 2 2 ) (29004 46092)

48percentage of data values in this interval 96

5068-95-997 rule 95

y s

y s y s

2 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 3 3 ) (24732 50364)

50percentage of data values in this interval 100

5068-95-997 rule 997

y s

y s y s

3 standard deviation interval about the mean

The best estimate of the standard deviation of the menrsquos weights

displayed in this dotplot is

1 10

2 15

3 20

4 40

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

Preceding slides Next

Z-scores Standardized Data Values

Measures the distance of a number from the mean in units of

the standard deviation

z-score corresponding to y

where

original data value

the sample mean

s the sample standard deviation

the z-score corresponding to

y yz

s

y

y

z y

Exam 1 y1 = 88 s1 = 6 your exam 1 score 91

Exam 2 y2 = 88 s2 = 10 your exam 2 score 92

Which score is better

1

2

91 88 3z 5

6 692 88 4

z 410 10

91 on exam 1 is better than 92 on exam 2

If data has mean and standard deviation

then standardizing a particular value of

indicates how many standard deviations

is above or below the mean

y s

y

y

y

Comparing SAT and ACT Scores

SAT Math Eleanorrsquos score 680

SAT mean =500 sd=100 ACT Math Geraldrsquos score 27

ACT mean=18 sd=6 Eleanorrsquos z-score z=(680-500)100=18 Geraldrsquos z-score z=(27-18)6=15 Eleanorrsquos score is better

Z-scores add to zeroStudentInstitutional Support to Athletic Depts For the 9 Public ACC

Schools 2013 ($ millions)

School Support y - ybar Z-score

Maryland 155 64 179

UVA 131 40 112

Louisville 109 18 050

UNC 92 01 003

VaTech 79 -12 -034

FSU 79 -12 -034

GaTech 71 -20 -056

NCSU 65 -26 -073

Clemson 38 -53 -147

Mean=91000 s=35697

Sum = 0 Sum = 0

Recently the mean tuition at 4-yr public collegesuniversities in the US was $6185 with a standard deviation of $1804 In NC the mean tuition was $4320 What is NCrsquos z-score

1 103

2 -103

3 239

4 1865

5 -1865

Section 34Measures of Position (also called Measures of Relative Standing)

Quartiles

5-Number Summary

Interquartile Range Another Measure of Spread

Boxplots

m = median = 34

Q1= first quartile = 23

Q3= third quartile = 42

1 1 062 2 123 3 164 4 195 5 156 6 217 7 238 6 239 5 2510 4 2811 3 2912 2 3313 1 3414 2 3615 3 3716 4 3817 5 3918 6 4119 7 4220 6 4521 5 4722 4 4923 3 5324 2 5625 1 61

Quartiles Measuring spread by examining the middleThe first quartile Q1 is the value in the

sample that has 25 of the data at or

below it (Q1 is the median of the lower

half of the sorted data)

The third quartile Q3 is the value in the

sample that has 75 of the data at or

below it (Q3 is the median of the upper

half of the sorted data)

Quartiles and median divide data into 4 pieces

Q1 M Q3

14 14 14 14

Quartiles are common measures of spread

httpoirpncsueduiradmit

httpoirpncsueduunivpeer

University of Southern California

Economic Value of College Majors

Rules for Calculating QuartilesStep 1 find the median of all the data (the median divides the data in half)

Step 2a find the median of the lower half this median is Q1Step 2b find the median of the upper half this median is Q3

Importantwhen n is odd include the overall median in both halveswhen n is even do not include the overall median in either half

Example 2 4 6 8 10 12 14 16 18 20 n = 10

Median m = (10+12)2 = 222 = 11

Q1 median of lower half 2 4 6 8 10

Q1 = 6

Q3 median of upper half 12 14 16 18 20

Q3 = 16

11

Pulse Rates n = 138

Stem Leaves4

3 4 5889 5 00123344410 5 555678889923 6 0001111112223333334444423 6 5555666666777778888888816 7 0000011222233444423 7 5555566666677788888899910 8 000011222410 8 55556677894 9 00122 9 584 10 0223

101 11 1

Median mean of pulses in locations 69 amp 70 median= (70+70)2=70

Q1 median of lower half (lower half = 69 smallest pulses) Q1 = pulse in ordered position 35Q1 = 63

Q3 median of upper half (upper half = 69 largest pulses) Q3= pulse in position 35 from the high end Q3=78

Below are the weights of 31 linemen on the NCSU football team What is the

value of the first quartile Q1

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 287

2 2575

3 2635

4 2625

Interquartile range another measure of spread

lower quartile Q1

middle quartile median upper quartile Q3

interquartile range (IQR)

IQR = Q3 ndash Q1

measures spread of middle 50 of the data

Example beginning pulse rates

Q3 = 78 Q1 = 63

IQR = 78 ndash 63 = 15

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 74: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

The Sample Standard Deviation a measure of spread around the mean Square the deviation of each

observation from the mean find the square root of the ldquoaveragerdquo of these squared deviations

2

1

2

2 1

( )sample standard deviation

1

( )is called the sample variance

1

n

ii

n

ii

y ys

n

y ys

n

Calculations hellip

Mean = 634

Sum of squared deviations from mean = 852

(n minus 1) = 13 (n minus 1) is called degrees freedom (df)

s2 = variance = 85213 = 655 square inches

s = standard deviation = radic655 = 256 inches

Women height (inches)i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

2

1

2 )(1

1xx

ns

n

i

1 First calculate the variance s22 Then take the square root to get the

standard deviation s

2

1

)(1

1xx

ns

n

i

Meanplusmn 1 sd

Wersquoll never calculate these by hand so make sure to know how to get the standard deviation using your calculator Excel or other software

Population Standard Deviation

2

1

Denoted by the lower case Greek letter

is the size (for example =34000 for NCSU)

is the mean

( )population standard deviation

va

po

lue of typically not known

us

pulation

populatio

e

n

N

ii

N N

y

N

s

to estimate value of

Remarks

1 The standard deviation of a set of measurements is an estimate of the likely size of the chance error in a single measurement

Remarks (cont)

2 Note that s and s are always greater than or equal to zero

3 The larger the value of s (or s ) the greater the spread of the data

When does s=0 When does s =0

When all data values are the same

Remarks (cont)4 The standard deviation is the most

commonly used measure of risk in finance and businessndash Stocks Mutual Funds etc

5 Variance s2 sample variance 2 population variance Units are squared units of the original data square $ square gallons

Review Properties of s and s s and s are always greater than or

equal to 0

when does s = 0 s = 0 The larger the value of s (or s) the

greater the spread of the data the standard deviation of a set of

measurements is an estimate of the likely size of the chance error in a single measurement

Summary of Notation

2

SAMPLE

sample mean

sample median

sample variance

sample stand dev

y

m

s

s

2

POPULATION

population mean

population median

population variance

population stand dev

m

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

68-95-997 rule

Mean andStandard Deviation

(numerical)

Histogram(graphical)

68-95-997 rule

The 68-95-997 ruleIf the histogram of the data is

approximately bell-shaped then1) approximately of the measurements

are of the mean

that is in ( )

2) approximately of the measurement

68

within 1 standard deviation

95

within 2 standard deviation

s

are of the meas n

that is

y s y s

almost all

within 3 standard deviation

in ( 2 2 )

3) the measurements

are of the mean

that is in ( 3 3 )

s

y s y s

y s y s

68-95-997 rule 68 within 1 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

68

3434

y-s y y+s

68-95-997 rule 95 within 2 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

95

475 475

y-2s y y+2s

Example textbook costs

37548

4272

50

y

s

n

286 291 307 308 315 316 327328 340 342 346 347 348 348 349354 355 355 360 361 364 367 369371 373 377 380 381 382 385 385387 390 390 397 398 409 409 410418 422 424 425 426 428 433 434437 440 480

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( ) (33276 41820)

32percentage of data values in this interval 64

5068-95-997 rule 68

y s

y s y s

1 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 2 2 ) (29004 46092)

48percentage of data values in this interval 96

5068-95-997 rule 95

y s

y s y s

2 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 3 3 ) (24732 50364)

50percentage of data values in this interval 100

5068-95-997 rule 997

y s

y s y s

3 standard deviation interval about the mean

The best estimate of the standard deviation of the menrsquos weights

displayed in this dotplot is

1 10

2 15

3 20

4 40

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

Preceding slides Next

Z-scores Standardized Data Values

Measures the distance of a number from the mean in units of

the standard deviation

z-score corresponding to y

where

original data value

the sample mean

s the sample standard deviation

the z-score corresponding to

y yz

s

y

y

z y

Exam 1 y1 = 88 s1 = 6 your exam 1 score 91

Exam 2 y2 = 88 s2 = 10 your exam 2 score 92

Which score is better

1

2

91 88 3z 5

6 692 88 4

z 410 10

91 on exam 1 is better than 92 on exam 2

If data has mean and standard deviation

then standardizing a particular value of

indicates how many standard deviations

is above or below the mean

y s

y

y

y

Comparing SAT and ACT Scores

SAT Math Eleanorrsquos score 680

SAT mean =500 sd=100 ACT Math Geraldrsquos score 27

ACT mean=18 sd=6 Eleanorrsquos z-score z=(680-500)100=18 Geraldrsquos z-score z=(27-18)6=15 Eleanorrsquos score is better

Z-scores add to zeroStudentInstitutional Support to Athletic Depts For the 9 Public ACC

Schools 2013 ($ millions)

School Support y - ybar Z-score

Maryland 155 64 179

UVA 131 40 112

Louisville 109 18 050

UNC 92 01 003

VaTech 79 -12 -034

FSU 79 -12 -034

GaTech 71 -20 -056

NCSU 65 -26 -073

Clemson 38 -53 -147

Mean=91000 s=35697

Sum = 0 Sum = 0

Recently the mean tuition at 4-yr public collegesuniversities in the US was $6185 with a standard deviation of $1804 In NC the mean tuition was $4320 What is NCrsquos z-score

1 103

2 -103

3 239

4 1865

5 -1865

Section 34Measures of Position (also called Measures of Relative Standing)

Quartiles

5-Number Summary

Interquartile Range Another Measure of Spread

Boxplots

m = median = 34

Q1= first quartile = 23

Q3= third quartile = 42

1 1 062 2 123 3 164 4 195 5 156 6 217 7 238 6 239 5 2510 4 2811 3 2912 2 3313 1 3414 2 3615 3 3716 4 3817 5 3918 6 4119 7 4220 6 4521 5 4722 4 4923 3 5324 2 5625 1 61

Quartiles Measuring spread by examining the middleThe first quartile Q1 is the value in the

sample that has 25 of the data at or

below it (Q1 is the median of the lower

half of the sorted data)

The third quartile Q3 is the value in the

sample that has 75 of the data at or

below it (Q3 is the median of the upper

half of the sorted data)

Quartiles and median divide data into 4 pieces

Q1 M Q3

14 14 14 14

Quartiles are common measures of spread

httpoirpncsueduiradmit

httpoirpncsueduunivpeer

University of Southern California

Economic Value of College Majors

Rules for Calculating QuartilesStep 1 find the median of all the data (the median divides the data in half)

Step 2a find the median of the lower half this median is Q1Step 2b find the median of the upper half this median is Q3

Importantwhen n is odd include the overall median in both halveswhen n is even do not include the overall median in either half

Example 2 4 6 8 10 12 14 16 18 20 n = 10

Median m = (10+12)2 = 222 = 11

Q1 median of lower half 2 4 6 8 10

Q1 = 6

Q3 median of upper half 12 14 16 18 20

Q3 = 16

11

Pulse Rates n = 138

Stem Leaves4

3 4 5889 5 00123344410 5 555678889923 6 0001111112223333334444423 6 5555666666777778888888816 7 0000011222233444423 7 5555566666677788888899910 8 000011222410 8 55556677894 9 00122 9 584 10 0223

101 11 1

Median mean of pulses in locations 69 amp 70 median= (70+70)2=70

Q1 median of lower half (lower half = 69 smallest pulses) Q1 = pulse in ordered position 35Q1 = 63

Q3 median of upper half (upper half = 69 largest pulses) Q3= pulse in position 35 from the high end Q3=78

Below are the weights of 31 linemen on the NCSU football team What is the

value of the first quartile Q1

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 287

2 2575

3 2635

4 2625

Interquartile range another measure of spread

lower quartile Q1

middle quartile median upper quartile Q3

interquartile range (IQR)

IQR = Q3 ndash Q1

measures spread of middle 50 of the data

Example beginning pulse rates

Q3 = 78 Q1 = 63

IQR = 78 ndash 63 = 15

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 75: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Calculations hellip

Mean = 634

Sum of squared deviations from mean = 852

(n minus 1) = 13 (n minus 1) is called degrees freedom (df)

s2 = variance = 85213 = 655 square inches

s = standard deviation = radic655 = 256 inches

Women height (inches)i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

2

1

2 )(1

1xx

ns

n

i

1 First calculate the variance s22 Then take the square root to get the

standard deviation s

2

1

)(1

1xx

ns

n

i

Meanplusmn 1 sd

Wersquoll never calculate these by hand so make sure to know how to get the standard deviation using your calculator Excel or other software

Population Standard Deviation

2

1

Denoted by the lower case Greek letter

is the size (for example =34000 for NCSU)

is the mean

( )population standard deviation

va

po

lue of typically not known

us

pulation

populatio

e

n

N

ii

N N

y

N

s

to estimate value of

Remarks

1 The standard deviation of a set of measurements is an estimate of the likely size of the chance error in a single measurement

Remarks (cont)

2 Note that s and s are always greater than or equal to zero

3 The larger the value of s (or s ) the greater the spread of the data

When does s=0 When does s =0

When all data values are the same

Remarks (cont)4 The standard deviation is the most

commonly used measure of risk in finance and businessndash Stocks Mutual Funds etc

5 Variance s2 sample variance 2 population variance Units are squared units of the original data square $ square gallons

Review Properties of s and s s and s are always greater than or

equal to 0

when does s = 0 s = 0 The larger the value of s (or s) the

greater the spread of the data the standard deviation of a set of

measurements is an estimate of the likely size of the chance error in a single measurement

Summary of Notation

2

SAMPLE

sample mean

sample median

sample variance

sample stand dev

y

m

s

s

2

POPULATION

population mean

population median

population variance

population stand dev

m

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

68-95-997 rule

Mean andStandard Deviation

(numerical)

Histogram(graphical)

68-95-997 rule

The 68-95-997 ruleIf the histogram of the data is

approximately bell-shaped then1) approximately of the measurements

are of the mean

that is in ( )

2) approximately of the measurement

68

within 1 standard deviation

95

within 2 standard deviation

s

are of the meas n

that is

y s y s

almost all

within 3 standard deviation

in ( 2 2 )

3) the measurements

are of the mean

that is in ( 3 3 )

s

y s y s

y s y s

68-95-997 rule 68 within 1 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

68

3434

y-s y y+s

68-95-997 rule 95 within 2 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

95

475 475

y-2s y y+2s

Example textbook costs

37548

4272

50

y

s

n

286 291 307 308 315 316 327328 340 342 346 347 348 348 349354 355 355 360 361 364 367 369371 373 377 380 381 382 385 385387 390 390 397 398 409 409 410418 422 424 425 426 428 433 434437 440 480

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( ) (33276 41820)

32percentage of data values in this interval 64

5068-95-997 rule 68

y s

y s y s

1 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 2 2 ) (29004 46092)

48percentage of data values in this interval 96

5068-95-997 rule 95

y s

y s y s

2 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 3 3 ) (24732 50364)

50percentage of data values in this interval 100

5068-95-997 rule 997

y s

y s y s

3 standard deviation interval about the mean

The best estimate of the standard deviation of the menrsquos weights

displayed in this dotplot is

1 10

2 15

3 20

4 40

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

Preceding slides Next

Z-scores Standardized Data Values

Measures the distance of a number from the mean in units of

the standard deviation

z-score corresponding to y

where

original data value

the sample mean

s the sample standard deviation

the z-score corresponding to

y yz

s

y

y

z y

Exam 1 y1 = 88 s1 = 6 your exam 1 score 91

Exam 2 y2 = 88 s2 = 10 your exam 2 score 92

Which score is better

1

2

91 88 3z 5

6 692 88 4

z 410 10

91 on exam 1 is better than 92 on exam 2

If data has mean and standard deviation

then standardizing a particular value of

indicates how many standard deviations

is above or below the mean

y s

y

y

y

Comparing SAT and ACT Scores

SAT Math Eleanorrsquos score 680

SAT mean =500 sd=100 ACT Math Geraldrsquos score 27

ACT mean=18 sd=6 Eleanorrsquos z-score z=(680-500)100=18 Geraldrsquos z-score z=(27-18)6=15 Eleanorrsquos score is better

Z-scores add to zeroStudentInstitutional Support to Athletic Depts For the 9 Public ACC

Schools 2013 ($ millions)

School Support y - ybar Z-score

Maryland 155 64 179

UVA 131 40 112

Louisville 109 18 050

UNC 92 01 003

VaTech 79 -12 -034

FSU 79 -12 -034

GaTech 71 -20 -056

NCSU 65 -26 -073

Clemson 38 -53 -147

Mean=91000 s=35697

Sum = 0 Sum = 0

Recently the mean tuition at 4-yr public collegesuniversities in the US was $6185 with a standard deviation of $1804 In NC the mean tuition was $4320 What is NCrsquos z-score

1 103

2 -103

3 239

4 1865

5 -1865

Section 34Measures of Position (also called Measures of Relative Standing)

Quartiles

5-Number Summary

Interquartile Range Another Measure of Spread

Boxplots

m = median = 34

Q1= first quartile = 23

Q3= third quartile = 42

1 1 062 2 123 3 164 4 195 5 156 6 217 7 238 6 239 5 2510 4 2811 3 2912 2 3313 1 3414 2 3615 3 3716 4 3817 5 3918 6 4119 7 4220 6 4521 5 4722 4 4923 3 5324 2 5625 1 61

Quartiles Measuring spread by examining the middleThe first quartile Q1 is the value in the

sample that has 25 of the data at or

below it (Q1 is the median of the lower

half of the sorted data)

The third quartile Q3 is the value in the

sample that has 75 of the data at or

below it (Q3 is the median of the upper

half of the sorted data)

Quartiles and median divide data into 4 pieces

Q1 M Q3

14 14 14 14

Quartiles are common measures of spread

httpoirpncsueduiradmit

httpoirpncsueduunivpeer

University of Southern California

Economic Value of College Majors

Rules for Calculating QuartilesStep 1 find the median of all the data (the median divides the data in half)

Step 2a find the median of the lower half this median is Q1Step 2b find the median of the upper half this median is Q3

Importantwhen n is odd include the overall median in both halveswhen n is even do not include the overall median in either half

Example 2 4 6 8 10 12 14 16 18 20 n = 10

Median m = (10+12)2 = 222 = 11

Q1 median of lower half 2 4 6 8 10

Q1 = 6

Q3 median of upper half 12 14 16 18 20

Q3 = 16

11

Pulse Rates n = 138

Stem Leaves4

3 4 5889 5 00123344410 5 555678889923 6 0001111112223333334444423 6 5555666666777778888888816 7 0000011222233444423 7 5555566666677788888899910 8 000011222410 8 55556677894 9 00122 9 584 10 0223

101 11 1

Median mean of pulses in locations 69 amp 70 median= (70+70)2=70

Q1 median of lower half (lower half = 69 smallest pulses) Q1 = pulse in ordered position 35Q1 = 63

Q3 median of upper half (upper half = 69 largest pulses) Q3= pulse in position 35 from the high end Q3=78

Below are the weights of 31 linemen on the NCSU football team What is the

value of the first quartile Q1

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 287

2 2575

3 2635

4 2625

Interquartile range another measure of spread

lower quartile Q1

middle quartile median upper quartile Q3

interquartile range (IQR)

IQR = Q3 ndash Q1

measures spread of middle 50 of the data

Example beginning pulse rates

Q3 = 78 Q1 = 63

IQR = 78 ndash 63 = 15

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 76: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

i xi x (xi-x) (xi-x)2

1 59 634 -44 190

2 60 634 -34 113

3 61 634 -24 56

4 62 634 -14 18

5 62 634 -14 18

6 63 634 -04 01

7 63 634 -04 01

8 63 634 -04 01

9 64 634 06 04

10 64 634 06 04

11 65 634 16 27

12 66 634 26 70

13 67 634 36 133

14 68 634 46 216

Mean 634

Sum 00

Sum 852

x

2

1

2 )(1

1xx

ns

n

i

1 First calculate the variance s22 Then take the square root to get the

standard deviation s

2

1

)(1

1xx

ns

n

i

Meanplusmn 1 sd

Wersquoll never calculate these by hand so make sure to know how to get the standard deviation using your calculator Excel or other software

Population Standard Deviation

2

1

Denoted by the lower case Greek letter

is the size (for example =34000 for NCSU)

is the mean

( )population standard deviation

va

po

lue of typically not known

us

pulation

populatio

e

n

N

ii

N N

y

N

s

to estimate value of

Remarks

1 The standard deviation of a set of measurements is an estimate of the likely size of the chance error in a single measurement

Remarks (cont)

2 Note that s and s are always greater than or equal to zero

3 The larger the value of s (or s ) the greater the spread of the data

When does s=0 When does s =0

When all data values are the same

Remarks (cont)4 The standard deviation is the most

commonly used measure of risk in finance and businessndash Stocks Mutual Funds etc

5 Variance s2 sample variance 2 population variance Units are squared units of the original data square $ square gallons

Review Properties of s and s s and s are always greater than or

equal to 0

when does s = 0 s = 0 The larger the value of s (or s) the

greater the spread of the data the standard deviation of a set of

measurements is an estimate of the likely size of the chance error in a single measurement

Summary of Notation

2

SAMPLE

sample mean

sample median

sample variance

sample stand dev

y

m

s

s

2

POPULATION

population mean

population median

population variance

population stand dev

m

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

68-95-997 rule

Mean andStandard Deviation

(numerical)

Histogram(graphical)

68-95-997 rule

The 68-95-997 ruleIf the histogram of the data is

approximately bell-shaped then1) approximately of the measurements

are of the mean

that is in ( )

2) approximately of the measurement

68

within 1 standard deviation

95

within 2 standard deviation

s

are of the meas n

that is

y s y s

almost all

within 3 standard deviation

in ( 2 2 )

3) the measurements

are of the mean

that is in ( 3 3 )

s

y s y s

y s y s

68-95-997 rule 68 within 1 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

68

3434

y-s y y+s

68-95-997 rule 95 within 2 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

95

475 475

y-2s y y+2s

Example textbook costs

37548

4272

50

y

s

n

286 291 307 308 315 316 327328 340 342 346 347 348 348 349354 355 355 360 361 364 367 369371 373 377 380 381 382 385 385387 390 390 397 398 409 409 410418 422 424 425 426 428 433 434437 440 480

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( ) (33276 41820)

32percentage of data values in this interval 64

5068-95-997 rule 68

y s

y s y s

1 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 2 2 ) (29004 46092)

48percentage of data values in this interval 96

5068-95-997 rule 95

y s

y s y s

2 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 3 3 ) (24732 50364)

50percentage of data values in this interval 100

5068-95-997 rule 997

y s

y s y s

3 standard deviation interval about the mean

The best estimate of the standard deviation of the menrsquos weights

displayed in this dotplot is

1 10

2 15

3 20

4 40

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

Preceding slides Next

Z-scores Standardized Data Values

Measures the distance of a number from the mean in units of

the standard deviation

z-score corresponding to y

where

original data value

the sample mean

s the sample standard deviation

the z-score corresponding to

y yz

s

y

y

z y

Exam 1 y1 = 88 s1 = 6 your exam 1 score 91

Exam 2 y2 = 88 s2 = 10 your exam 2 score 92

Which score is better

1

2

91 88 3z 5

6 692 88 4

z 410 10

91 on exam 1 is better than 92 on exam 2

If data has mean and standard deviation

then standardizing a particular value of

indicates how many standard deviations

is above or below the mean

y s

y

y

y

Comparing SAT and ACT Scores

SAT Math Eleanorrsquos score 680

SAT mean =500 sd=100 ACT Math Geraldrsquos score 27

ACT mean=18 sd=6 Eleanorrsquos z-score z=(680-500)100=18 Geraldrsquos z-score z=(27-18)6=15 Eleanorrsquos score is better

Z-scores add to zeroStudentInstitutional Support to Athletic Depts For the 9 Public ACC

Schools 2013 ($ millions)

School Support y - ybar Z-score

Maryland 155 64 179

UVA 131 40 112

Louisville 109 18 050

UNC 92 01 003

VaTech 79 -12 -034

FSU 79 -12 -034

GaTech 71 -20 -056

NCSU 65 -26 -073

Clemson 38 -53 -147

Mean=91000 s=35697

Sum = 0 Sum = 0

Recently the mean tuition at 4-yr public collegesuniversities in the US was $6185 with a standard deviation of $1804 In NC the mean tuition was $4320 What is NCrsquos z-score

1 103

2 -103

3 239

4 1865

5 -1865

Section 34Measures of Position (also called Measures of Relative Standing)

Quartiles

5-Number Summary

Interquartile Range Another Measure of Spread

Boxplots

m = median = 34

Q1= first quartile = 23

Q3= third quartile = 42

1 1 062 2 123 3 164 4 195 5 156 6 217 7 238 6 239 5 2510 4 2811 3 2912 2 3313 1 3414 2 3615 3 3716 4 3817 5 3918 6 4119 7 4220 6 4521 5 4722 4 4923 3 5324 2 5625 1 61

Quartiles Measuring spread by examining the middleThe first quartile Q1 is the value in the

sample that has 25 of the data at or

below it (Q1 is the median of the lower

half of the sorted data)

The third quartile Q3 is the value in the

sample that has 75 of the data at or

below it (Q3 is the median of the upper

half of the sorted data)

Quartiles and median divide data into 4 pieces

Q1 M Q3

14 14 14 14

Quartiles are common measures of spread

httpoirpncsueduiradmit

httpoirpncsueduunivpeer

University of Southern California

Economic Value of College Majors

Rules for Calculating QuartilesStep 1 find the median of all the data (the median divides the data in half)

Step 2a find the median of the lower half this median is Q1Step 2b find the median of the upper half this median is Q3

Importantwhen n is odd include the overall median in both halveswhen n is even do not include the overall median in either half

Example 2 4 6 8 10 12 14 16 18 20 n = 10

Median m = (10+12)2 = 222 = 11

Q1 median of lower half 2 4 6 8 10

Q1 = 6

Q3 median of upper half 12 14 16 18 20

Q3 = 16

11

Pulse Rates n = 138

Stem Leaves4

3 4 5889 5 00123344410 5 555678889923 6 0001111112223333334444423 6 5555666666777778888888816 7 0000011222233444423 7 5555566666677788888899910 8 000011222410 8 55556677894 9 00122 9 584 10 0223

101 11 1

Median mean of pulses in locations 69 amp 70 median= (70+70)2=70

Q1 median of lower half (lower half = 69 smallest pulses) Q1 = pulse in ordered position 35Q1 = 63

Q3 median of upper half (upper half = 69 largest pulses) Q3= pulse in position 35 from the high end Q3=78

Below are the weights of 31 linemen on the NCSU football team What is the

value of the first quartile Q1

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 287

2 2575

3 2635

4 2625

Interquartile range another measure of spread

lower quartile Q1

middle quartile median upper quartile Q3

interquartile range (IQR)

IQR = Q3 ndash Q1

measures spread of middle 50 of the data

Example beginning pulse rates

Q3 = 78 Q1 = 63

IQR = 78 ndash 63 = 15

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 77: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Population Standard Deviation

2

1

Denoted by the lower case Greek letter

is the size (for example =34000 for NCSU)

is the mean

( )population standard deviation

va

po

lue of typically not known

us

pulation

populatio

e

n

N

ii

N N

y

N

s

to estimate value of

Remarks

1 The standard deviation of a set of measurements is an estimate of the likely size of the chance error in a single measurement

Remarks (cont)

2 Note that s and s are always greater than or equal to zero

3 The larger the value of s (or s ) the greater the spread of the data

When does s=0 When does s =0

When all data values are the same

Remarks (cont)4 The standard deviation is the most

commonly used measure of risk in finance and businessndash Stocks Mutual Funds etc

5 Variance s2 sample variance 2 population variance Units are squared units of the original data square $ square gallons

Review Properties of s and s s and s are always greater than or

equal to 0

when does s = 0 s = 0 The larger the value of s (or s) the

greater the spread of the data the standard deviation of a set of

measurements is an estimate of the likely size of the chance error in a single measurement

Summary of Notation

2

SAMPLE

sample mean

sample median

sample variance

sample stand dev

y

m

s

s

2

POPULATION

population mean

population median

population variance

population stand dev

m

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

68-95-997 rule

Mean andStandard Deviation

(numerical)

Histogram(graphical)

68-95-997 rule

The 68-95-997 ruleIf the histogram of the data is

approximately bell-shaped then1) approximately of the measurements

are of the mean

that is in ( )

2) approximately of the measurement

68

within 1 standard deviation

95

within 2 standard deviation

s

are of the meas n

that is

y s y s

almost all

within 3 standard deviation

in ( 2 2 )

3) the measurements

are of the mean

that is in ( 3 3 )

s

y s y s

y s y s

68-95-997 rule 68 within 1 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

68

3434

y-s y y+s

68-95-997 rule 95 within 2 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

95

475 475

y-2s y y+2s

Example textbook costs

37548

4272

50

y

s

n

286 291 307 308 315 316 327328 340 342 346 347 348 348 349354 355 355 360 361 364 367 369371 373 377 380 381 382 385 385387 390 390 397 398 409 409 410418 422 424 425 426 428 433 434437 440 480

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( ) (33276 41820)

32percentage of data values in this interval 64

5068-95-997 rule 68

y s

y s y s

1 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 2 2 ) (29004 46092)

48percentage of data values in this interval 96

5068-95-997 rule 95

y s

y s y s

2 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 3 3 ) (24732 50364)

50percentage of data values in this interval 100

5068-95-997 rule 997

y s

y s y s

3 standard deviation interval about the mean

The best estimate of the standard deviation of the menrsquos weights

displayed in this dotplot is

1 10

2 15

3 20

4 40

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

Preceding slides Next

Z-scores Standardized Data Values

Measures the distance of a number from the mean in units of

the standard deviation

z-score corresponding to y

where

original data value

the sample mean

s the sample standard deviation

the z-score corresponding to

y yz

s

y

y

z y

Exam 1 y1 = 88 s1 = 6 your exam 1 score 91

Exam 2 y2 = 88 s2 = 10 your exam 2 score 92

Which score is better

1

2

91 88 3z 5

6 692 88 4

z 410 10

91 on exam 1 is better than 92 on exam 2

If data has mean and standard deviation

then standardizing a particular value of

indicates how many standard deviations

is above or below the mean

y s

y

y

y

Comparing SAT and ACT Scores

SAT Math Eleanorrsquos score 680

SAT mean =500 sd=100 ACT Math Geraldrsquos score 27

ACT mean=18 sd=6 Eleanorrsquos z-score z=(680-500)100=18 Geraldrsquos z-score z=(27-18)6=15 Eleanorrsquos score is better

Z-scores add to zeroStudentInstitutional Support to Athletic Depts For the 9 Public ACC

Schools 2013 ($ millions)

School Support y - ybar Z-score

Maryland 155 64 179

UVA 131 40 112

Louisville 109 18 050

UNC 92 01 003

VaTech 79 -12 -034

FSU 79 -12 -034

GaTech 71 -20 -056

NCSU 65 -26 -073

Clemson 38 -53 -147

Mean=91000 s=35697

Sum = 0 Sum = 0

Recently the mean tuition at 4-yr public collegesuniversities in the US was $6185 with a standard deviation of $1804 In NC the mean tuition was $4320 What is NCrsquos z-score

1 103

2 -103

3 239

4 1865

5 -1865

Section 34Measures of Position (also called Measures of Relative Standing)

Quartiles

5-Number Summary

Interquartile Range Another Measure of Spread

Boxplots

m = median = 34

Q1= first quartile = 23

Q3= third quartile = 42

1 1 062 2 123 3 164 4 195 5 156 6 217 7 238 6 239 5 2510 4 2811 3 2912 2 3313 1 3414 2 3615 3 3716 4 3817 5 3918 6 4119 7 4220 6 4521 5 4722 4 4923 3 5324 2 5625 1 61

Quartiles Measuring spread by examining the middleThe first quartile Q1 is the value in the

sample that has 25 of the data at or

below it (Q1 is the median of the lower

half of the sorted data)

The third quartile Q3 is the value in the

sample that has 75 of the data at or

below it (Q3 is the median of the upper

half of the sorted data)

Quartiles and median divide data into 4 pieces

Q1 M Q3

14 14 14 14

Quartiles are common measures of spread

httpoirpncsueduiradmit

httpoirpncsueduunivpeer

University of Southern California

Economic Value of College Majors

Rules for Calculating QuartilesStep 1 find the median of all the data (the median divides the data in half)

Step 2a find the median of the lower half this median is Q1Step 2b find the median of the upper half this median is Q3

Importantwhen n is odd include the overall median in both halveswhen n is even do not include the overall median in either half

Example 2 4 6 8 10 12 14 16 18 20 n = 10

Median m = (10+12)2 = 222 = 11

Q1 median of lower half 2 4 6 8 10

Q1 = 6

Q3 median of upper half 12 14 16 18 20

Q3 = 16

11

Pulse Rates n = 138

Stem Leaves4

3 4 5889 5 00123344410 5 555678889923 6 0001111112223333334444423 6 5555666666777778888888816 7 0000011222233444423 7 5555566666677788888899910 8 000011222410 8 55556677894 9 00122 9 584 10 0223

101 11 1

Median mean of pulses in locations 69 amp 70 median= (70+70)2=70

Q1 median of lower half (lower half = 69 smallest pulses) Q1 = pulse in ordered position 35Q1 = 63

Q3 median of upper half (upper half = 69 largest pulses) Q3= pulse in position 35 from the high end Q3=78

Below are the weights of 31 linemen on the NCSU football team What is the

value of the first quartile Q1

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 287

2 2575

3 2635

4 2625

Interquartile range another measure of spread

lower quartile Q1

middle quartile median upper quartile Q3

interquartile range (IQR)

IQR = Q3 ndash Q1

measures spread of middle 50 of the data

Example beginning pulse rates

Q3 = 78 Q1 = 63

IQR = 78 ndash 63 = 15

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 78: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Remarks

1 The standard deviation of a set of measurements is an estimate of the likely size of the chance error in a single measurement

Remarks (cont)

2 Note that s and s are always greater than or equal to zero

3 The larger the value of s (or s ) the greater the spread of the data

When does s=0 When does s =0

When all data values are the same

Remarks (cont)4 The standard deviation is the most

commonly used measure of risk in finance and businessndash Stocks Mutual Funds etc

5 Variance s2 sample variance 2 population variance Units are squared units of the original data square $ square gallons

Review Properties of s and s s and s are always greater than or

equal to 0

when does s = 0 s = 0 The larger the value of s (or s) the

greater the spread of the data the standard deviation of a set of

measurements is an estimate of the likely size of the chance error in a single measurement

Summary of Notation

2

SAMPLE

sample mean

sample median

sample variance

sample stand dev

y

m

s

s

2

POPULATION

population mean

population median

population variance

population stand dev

m

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

68-95-997 rule

Mean andStandard Deviation

(numerical)

Histogram(graphical)

68-95-997 rule

The 68-95-997 ruleIf the histogram of the data is

approximately bell-shaped then1) approximately of the measurements

are of the mean

that is in ( )

2) approximately of the measurement

68

within 1 standard deviation

95

within 2 standard deviation

s

are of the meas n

that is

y s y s

almost all

within 3 standard deviation

in ( 2 2 )

3) the measurements

are of the mean

that is in ( 3 3 )

s

y s y s

y s y s

68-95-997 rule 68 within 1 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

68

3434

y-s y y+s

68-95-997 rule 95 within 2 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

95

475 475

y-2s y y+2s

Example textbook costs

37548

4272

50

y

s

n

286 291 307 308 315 316 327328 340 342 346 347 348 348 349354 355 355 360 361 364 367 369371 373 377 380 381 382 385 385387 390 390 397 398 409 409 410418 422 424 425 426 428 433 434437 440 480

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( ) (33276 41820)

32percentage of data values in this interval 64

5068-95-997 rule 68

y s

y s y s

1 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 2 2 ) (29004 46092)

48percentage of data values in this interval 96

5068-95-997 rule 95

y s

y s y s

2 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 3 3 ) (24732 50364)

50percentage of data values in this interval 100

5068-95-997 rule 997

y s

y s y s

3 standard deviation interval about the mean

The best estimate of the standard deviation of the menrsquos weights

displayed in this dotplot is

1 10

2 15

3 20

4 40

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

Preceding slides Next

Z-scores Standardized Data Values

Measures the distance of a number from the mean in units of

the standard deviation

z-score corresponding to y

where

original data value

the sample mean

s the sample standard deviation

the z-score corresponding to

y yz

s

y

y

z y

Exam 1 y1 = 88 s1 = 6 your exam 1 score 91

Exam 2 y2 = 88 s2 = 10 your exam 2 score 92

Which score is better

1

2

91 88 3z 5

6 692 88 4

z 410 10

91 on exam 1 is better than 92 on exam 2

If data has mean and standard deviation

then standardizing a particular value of

indicates how many standard deviations

is above or below the mean

y s

y

y

y

Comparing SAT and ACT Scores

SAT Math Eleanorrsquos score 680

SAT mean =500 sd=100 ACT Math Geraldrsquos score 27

ACT mean=18 sd=6 Eleanorrsquos z-score z=(680-500)100=18 Geraldrsquos z-score z=(27-18)6=15 Eleanorrsquos score is better

Z-scores add to zeroStudentInstitutional Support to Athletic Depts For the 9 Public ACC

Schools 2013 ($ millions)

School Support y - ybar Z-score

Maryland 155 64 179

UVA 131 40 112

Louisville 109 18 050

UNC 92 01 003

VaTech 79 -12 -034

FSU 79 -12 -034

GaTech 71 -20 -056

NCSU 65 -26 -073

Clemson 38 -53 -147

Mean=91000 s=35697

Sum = 0 Sum = 0

Recently the mean tuition at 4-yr public collegesuniversities in the US was $6185 with a standard deviation of $1804 In NC the mean tuition was $4320 What is NCrsquos z-score

1 103

2 -103

3 239

4 1865

5 -1865

Section 34Measures of Position (also called Measures of Relative Standing)

Quartiles

5-Number Summary

Interquartile Range Another Measure of Spread

Boxplots

m = median = 34

Q1= first quartile = 23

Q3= third quartile = 42

1 1 062 2 123 3 164 4 195 5 156 6 217 7 238 6 239 5 2510 4 2811 3 2912 2 3313 1 3414 2 3615 3 3716 4 3817 5 3918 6 4119 7 4220 6 4521 5 4722 4 4923 3 5324 2 5625 1 61

Quartiles Measuring spread by examining the middleThe first quartile Q1 is the value in the

sample that has 25 of the data at or

below it (Q1 is the median of the lower

half of the sorted data)

The third quartile Q3 is the value in the

sample that has 75 of the data at or

below it (Q3 is the median of the upper

half of the sorted data)

Quartiles and median divide data into 4 pieces

Q1 M Q3

14 14 14 14

Quartiles are common measures of spread

httpoirpncsueduiradmit

httpoirpncsueduunivpeer

University of Southern California

Economic Value of College Majors

Rules for Calculating QuartilesStep 1 find the median of all the data (the median divides the data in half)

Step 2a find the median of the lower half this median is Q1Step 2b find the median of the upper half this median is Q3

Importantwhen n is odd include the overall median in both halveswhen n is even do not include the overall median in either half

Example 2 4 6 8 10 12 14 16 18 20 n = 10

Median m = (10+12)2 = 222 = 11

Q1 median of lower half 2 4 6 8 10

Q1 = 6

Q3 median of upper half 12 14 16 18 20

Q3 = 16

11

Pulse Rates n = 138

Stem Leaves4

3 4 5889 5 00123344410 5 555678889923 6 0001111112223333334444423 6 5555666666777778888888816 7 0000011222233444423 7 5555566666677788888899910 8 000011222410 8 55556677894 9 00122 9 584 10 0223

101 11 1

Median mean of pulses in locations 69 amp 70 median= (70+70)2=70

Q1 median of lower half (lower half = 69 smallest pulses) Q1 = pulse in ordered position 35Q1 = 63

Q3 median of upper half (upper half = 69 largest pulses) Q3= pulse in position 35 from the high end Q3=78

Below are the weights of 31 linemen on the NCSU football team What is the

value of the first quartile Q1

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 287

2 2575

3 2635

4 2625

Interquartile range another measure of spread

lower quartile Q1

middle quartile median upper quartile Q3

interquartile range (IQR)

IQR = Q3 ndash Q1

measures spread of middle 50 of the data

Example beginning pulse rates

Q3 = 78 Q1 = 63

IQR = 78 ndash 63 = 15

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 79: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Remarks (cont)

2 Note that s and s are always greater than or equal to zero

3 The larger the value of s (or s ) the greater the spread of the data

When does s=0 When does s =0

When all data values are the same

Remarks (cont)4 The standard deviation is the most

commonly used measure of risk in finance and businessndash Stocks Mutual Funds etc

5 Variance s2 sample variance 2 population variance Units are squared units of the original data square $ square gallons

Review Properties of s and s s and s are always greater than or

equal to 0

when does s = 0 s = 0 The larger the value of s (or s) the

greater the spread of the data the standard deviation of a set of

measurements is an estimate of the likely size of the chance error in a single measurement

Summary of Notation

2

SAMPLE

sample mean

sample median

sample variance

sample stand dev

y

m

s

s

2

POPULATION

population mean

population median

population variance

population stand dev

m

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

68-95-997 rule

Mean andStandard Deviation

(numerical)

Histogram(graphical)

68-95-997 rule

The 68-95-997 ruleIf the histogram of the data is

approximately bell-shaped then1) approximately of the measurements

are of the mean

that is in ( )

2) approximately of the measurement

68

within 1 standard deviation

95

within 2 standard deviation

s

are of the meas n

that is

y s y s

almost all

within 3 standard deviation

in ( 2 2 )

3) the measurements

are of the mean

that is in ( 3 3 )

s

y s y s

y s y s

68-95-997 rule 68 within 1 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

68

3434

y-s y y+s

68-95-997 rule 95 within 2 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

95

475 475

y-2s y y+2s

Example textbook costs

37548

4272

50

y

s

n

286 291 307 308 315 316 327328 340 342 346 347 348 348 349354 355 355 360 361 364 367 369371 373 377 380 381 382 385 385387 390 390 397 398 409 409 410418 422 424 425 426 428 433 434437 440 480

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( ) (33276 41820)

32percentage of data values in this interval 64

5068-95-997 rule 68

y s

y s y s

1 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 2 2 ) (29004 46092)

48percentage of data values in this interval 96

5068-95-997 rule 95

y s

y s y s

2 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 3 3 ) (24732 50364)

50percentage of data values in this interval 100

5068-95-997 rule 997

y s

y s y s

3 standard deviation interval about the mean

The best estimate of the standard deviation of the menrsquos weights

displayed in this dotplot is

1 10

2 15

3 20

4 40

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

Preceding slides Next

Z-scores Standardized Data Values

Measures the distance of a number from the mean in units of

the standard deviation

z-score corresponding to y

where

original data value

the sample mean

s the sample standard deviation

the z-score corresponding to

y yz

s

y

y

z y

Exam 1 y1 = 88 s1 = 6 your exam 1 score 91

Exam 2 y2 = 88 s2 = 10 your exam 2 score 92

Which score is better

1

2

91 88 3z 5

6 692 88 4

z 410 10

91 on exam 1 is better than 92 on exam 2

If data has mean and standard deviation

then standardizing a particular value of

indicates how many standard deviations

is above or below the mean

y s

y

y

y

Comparing SAT and ACT Scores

SAT Math Eleanorrsquos score 680

SAT mean =500 sd=100 ACT Math Geraldrsquos score 27

ACT mean=18 sd=6 Eleanorrsquos z-score z=(680-500)100=18 Geraldrsquos z-score z=(27-18)6=15 Eleanorrsquos score is better

Z-scores add to zeroStudentInstitutional Support to Athletic Depts For the 9 Public ACC

Schools 2013 ($ millions)

School Support y - ybar Z-score

Maryland 155 64 179

UVA 131 40 112

Louisville 109 18 050

UNC 92 01 003

VaTech 79 -12 -034

FSU 79 -12 -034

GaTech 71 -20 -056

NCSU 65 -26 -073

Clemson 38 -53 -147

Mean=91000 s=35697

Sum = 0 Sum = 0

Recently the mean tuition at 4-yr public collegesuniversities in the US was $6185 with a standard deviation of $1804 In NC the mean tuition was $4320 What is NCrsquos z-score

1 103

2 -103

3 239

4 1865

5 -1865

Section 34Measures of Position (also called Measures of Relative Standing)

Quartiles

5-Number Summary

Interquartile Range Another Measure of Spread

Boxplots

m = median = 34

Q1= first quartile = 23

Q3= third quartile = 42

1 1 062 2 123 3 164 4 195 5 156 6 217 7 238 6 239 5 2510 4 2811 3 2912 2 3313 1 3414 2 3615 3 3716 4 3817 5 3918 6 4119 7 4220 6 4521 5 4722 4 4923 3 5324 2 5625 1 61

Quartiles Measuring spread by examining the middleThe first quartile Q1 is the value in the

sample that has 25 of the data at or

below it (Q1 is the median of the lower

half of the sorted data)

The third quartile Q3 is the value in the

sample that has 75 of the data at or

below it (Q3 is the median of the upper

half of the sorted data)

Quartiles and median divide data into 4 pieces

Q1 M Q3

14 14 14 14

Quartiles are common measures of spread

httpoirpncsueduiradmit

httpoirpncsueduunivpeer

University of Southern California

Economic Value of College Majors

Rules for Calculating QuartilesStep 1 find the median of all the data (the median divides the data in half)

Step 2a find the median of the lower half this median is Q1Step 2b find the median of the upper half this median is Q3

Importantwhen n is odd include the overall median in both halveswhen n is even do not include the overall median in either half

Example 2 4 6 8 10 12 14 16 18 20 n = 10

Median m = (10+12)2 = 222 = 11

Q1 median of lower half 2 4 6 8 10

Q1 = 6

Q3 median of upper half 12 14 16 18 20

Q3 = 16

11

Pulse Rates n = 138

Stem Leaves4

3 4 5889 5 00123344410 5 555678889923 6 0001111112223333334444423 6 5555666666777778888888816 7 0000011222233444423 7 5555566666677788888899910 8 000011222410 8 55556677894 9 00122 9 584 10 0223

101 11 1

Median mean of pulses in locations 69 amp 70 median= (70+70)2=70

Q1 median of lower half (lower half = 69 smallest pulses) Q1 = pulse in ordered position 35Q1 = 63

Q3 median of upper half (upper half = 69 largest pulses) Q3= pulse in position 35 from the high end Q3=78

Below are the weights of 31 linemen on the NCSU football team What is the

value of the first quartile Q1

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 287

2 2575

3 2635

4 2625

Interquartile range another measure of spread

lower quartile Q1

middle quartile median upper quartile Q3

interquartile range (IQR)

IQR = Q3 ndash Q1

measures spread of middle 50 of the data

Example beginning pulse rates

Q3 = 78 Q1 = 63

IQR = 78 ndash 63 = 15

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 80: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Remarks (cont)4 The standard deviation is the most

commonly used measure of risk in finance and businessndash Stocks Mutual Funds etc

5 Variance s2 sample variance 2 population variance Units are squared units of the original data square $ square gallons

Review Properties of s and s s and s are always greater than or

equal to 0

when does s = 0 s = 0 The larger the value of s (or s) the

greater the spread of the data the standard deviation of a set of

measurements is an estimate of the likely size of the chance error in a single measurement

Summary of Notation

2

SAMPLE

sample mean

sample median

sample variance

sample stand dev

y

m

s

s

2

POPULATION

population mean

population median

population variance

population stand dev

m

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

68-95-997 rule

Mean andStandard Deviation

(numerical)

Histogram(graphical)

68-95-997 rule

The 68-95-997 ruleIf the histogram of the data is

approximately bell-shaped then1) approximately of the measurements

are of the mean

that is in ( )

2) approximately of the measurement

68

within 1 standard deviation

95

within 2 standard deviation

s

are of the meas n

that is

y s y s

almost all

within 3 standard deviation

in ( 2 2 )

3) the measurements

are of the mean

that is in ( 3 3 )

s

y s y s

y s y s

68-95-997 rule 68 within 1 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

68

3434

y-s y y+s

68-95-997 rule 95 within 2 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

95

475 475

y-2s y y+2s

Example textbook costs

37548

4272

50

y

s

n

286 291 307 308 315 316 327328 340 342 346 347 348 348 349354 355 355 360 361 364 367 369371 373 377 380 381 382 385 385387 390 390 397 398 409 409 410418 422 424 425 426 428 433 434437 440 480

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( ) (33276 41820)

32percentage of data values in this interval 64

5068-95-997 rule 68

y s

y s y s

1 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 2 2 ) (29004 46092)

48percentage of data values in this interval 96

5068-95-997 rule 95

y s

y s y s

2 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 3 3 ) (24732 50364)

50percentage of data values in this interval 100

5068-95-997 rule 997

y s

y s y s

3 standard deviation interval about the mean

The best estimate of the standard deviation of the menrsquos weights

displayed in this dotplot is

1 10

2 15

3 20

4 40

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

Preceding slides Next

Z-scores Standardized Data Values

Measures the distance of a number from the mean in units of

the standard deviation

z-score corresponding to y

where

original data value

the sample mean

s the sample standard deviation

the z-score corresponding to

y yz

s

y

y

z y

Exam 1 y1 = 88 s1 = 6 your exam 1 score 91

Exam 2 y2 = 88 s2 = 10 your exam 2 score 92

Which score is better

1

2

91 88 3z 5

6 692 88 4

z 410 10

91 on exam 1 is better than 92 on exam 2

If data has mean and standard deviation

then standardizing a particular value of

indicates how many standard deviations

is above or below the mean

y s

y

y

y

Comparing SAT and ACT Scores

SAT Math Eleanorrsquos score 680

SAT mean =500 sd=100 ACT Math Geraldrsquos score 27

ACT mean=18 sd=6 Eleanorrsquos z-score z=(680-500)100=18 Geraldrsquos z-score z=(27-18)6=15 Eleanorrsquos score is better

Z-scores add to zeroStudentInstitutional Support to Athletic Depts For the 9 Public ACC

Schools 2013 ($ millions)

School Support y - ybar Z-score

Maryland 155 64 179

UVA 131 40 112

Louisville 109 18 050

UNC 92 01 003

VaTech 79 -12 -034

FSU 79 -12 -034

GaTech 71 -20 -056

NCSU 65 -26 -073

Clemson 38 -53 -147

Mean=91000 s=35697

Sum = 0 Sum = 0

Recently the mean tuition at 4-yr public collegesuniversities in the US was $6185 with a standard deviation of $1804 In NC the mean tuition was $4320 What is NCrsquos z-score

1 103

2 -103

3 239

4 1865

5 -1865

Section 34Measures of Position (also called Measures of Relative Standing)

Quartiles

5-Number Summary

Interquartile Range Another Measure of Spread

Boxplots

m = median = 34

Q1= first quartile = 23

Q3= third quartile = 42

1 1 062 2 123 3 164 4 195 5 156 6 217 7 238 6 239 5 2510 4 2811 3 2912 2 3313 1 3414 2 3615 3 3716 4 3817 5 3918 6 4119 7 4220 6 4521 5 4722 4 4923 3 5324 2 5625 1 61

Quartiles Measuring spread by examining the middleThe first quartile Q1 is the value in the

sample that has 25 of the data at or

below it (Q1 is the median of the lower

half of the sorted data)

The third quartile Q3 is the value in the

sample that has 75 of the data at or

below it (Q3 is the median of the upper

half of the sorted data)

Quartiles and median divide data into 4 pieces

Q1 M Q3

14 14 14 14

Quartiles are common measures of spread

httpoirpncsueduiradmit

httpoirpncsueduunivpeer

University of Southern California

Economic Value of College Majors

Rules for Calculating QuartilesStep 1 find the median of all the data (the median divides the data in half)

Step 2a find the median of the lower half this median is Q1Step 2b find the median of the upper half this median is Q3

Importantwhen n is odd include the overall median in both halveswhen n is even do not include the overall median in either half

Example 2 4 6 8 10 12 14 16 18 20 n = 10

Median m = (10+12)2 = 222 = 11

Q1 median of lower half 2 4 6 8 10

Q1 = 6

Q3 median of upper half 12 14 16 18 20

Q3 = 16

11

Pulse Rates n = 138

Stem Leaves4

3 4 5889 5 00123344410 5 555678889923 6 0001111112223333334444423 6 5555666666777778888888816 7 0000011222233444423 7 5555566666677788888899910 8 000011222410 8 55556677894 9 00122 9 584 10 0223

101 11 1

Median mean of pulses in locations 69 amp 70 median= (70+70)2=70

Q1 median of lower half (lower half = 69 smallest pulses) Q1 = pulse in ordered position 35Q1 = 63

Q3 median of upper half (upper half = 69 largest pulses) Q3= pulse in position 35 from the high end Q3=78

Below are the weights of 31 linemen on the NCSU football team What is the

value of the first quartile Q1

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 287

2 2575

3 2635

4 2625

Interquartile range another measure of spread

lower quartile Q1

middle quartile median upper quartile Q3

interquartile range (IQR)

IQR = Q3 ndash Q1

measures spread of middle 50 of the data

Example beginning pulse rates

Q3 = 78 Q1 = 63

IQR = 78 ndash 63 = 15

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 81: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Review Properties of s and s s and s are always greater than or

equal to 0

when does s = 0 s = 0 The larger the value of s (or s) the

greater the spread of the data the standard deviation of a set of

measurements is an estimate of the likely size of the chance error in a single measurement

Summary of Notation

2

SAMPLE

sample mean

sample median

sample variance

sample stand dev

y

m

s

s

2

POPULATION

population mean

population median

population variance

population stand dev

m

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

68-95-997 rule

Mean andStandard Deviation

(numerical)

Histogram(graphical)

68-95-997 rule

The 68-95-997 ruleIf the histogram of the data is

approximately bell-shaped then1) approximately of the measurements

are of the mean

that is in ( )

2) approximately of the measurement

68

within 1 standard deviation

95

within 2 standard deviation

s

are of the meas n

that is

y s y s

almost all

within 3 standard deviation

in ( 2 2 )

3) the measurements

are of the mean

that is in ( 3 3 )

s

y s y s

y s y s

68-95-997 rule 68 within 1 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

68

3434

y-s y y+s

68-95-997 rule 95 within 2 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

95

475 475

y-2s y y+2s

Example textbook costs

37548

4272

50

y

s

n

286 291 307 308 315 316 327328 340 342 346 347 348 348 349354 355 355 360 361 364 367 369371 373 377 380 381 382 385 385387 390 390 397 398 409 409 410418 422 424 425 426 428 433 434437 440 480

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( ) (33276 41820)

32percentage of data values in this interval 64

5068-95-997 rule 68

y s

y s y s

1 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 2 2 ) (29004 46092)

48percentage of data values in this interval 96

5068-95-997 rule 95

y s

y s y s

2 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 3 3 ) (24732 50364)

50percentage of data values in this interval 100

5068-95-997 rule 997

y s

y s y s

3 standard deviation interval about the mean

The best estimate of the standard deviation of the menrsquos weights

displayed in this dotplot is

1 10

2 15

3 20

4 40

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

Preceding slides Next

Z-scores Standardized Data Values

Measures the distance of a number from the mean in units of

the standard deviation

z-score corresponding to y

where

original data value

the sample mean

s the sample standard deviation

the z-score corresponding to

y yz

s

y

y

z y

Exam 1 y1 = 88 s1 = 6 your exam 1 score 91

Exam 2 y2 = 88 s2 = 10 your exam 2 score 92

Which score is better

1

2

91 88 3z 5

6 692 88 4

z 410 10

91 on exam 1 is better than 92 on exam 2

If data has mean and standard deviation

then standardizing a particular value of

indicates how many standard deviations

is above or below the mean

y s

y

y

y

Comparing SAT and ACT Scores

SAT Math Eleanorrsquos score 680

SAT mean =500 sd=100 ACT Math Geraldrsquos score 27

ACT mean=18 sd=6 Eleanorrsquos z-score z=(680-500)100=18 Geraldrsquos z-score z=(27-18)6=15 Eleanorrsquos score is better

Z-scores add to zeroStudentInstitutional Support to Athletic Depts For the 9 Public ACC

Schools 2013 ($ millions)

School Support y - ybar Z-score

Maryland 155 64 179

UVA 131 40 112

Louisville 109 18 050

UNC 92 01 003

VaTech 79 -12 -034

FSU 79 -12 -034

GaTech 71 -20 -056

NCSU 65 -26 -073

Clemson 38 -53 -147

Mean=91000 s=35697

Sum = 0 Sum = 0

Recently the mean tuition at 4-yr public collegesuniversities in the US was $6185 with a standard deviation of $1804 In NC the mean tuition was $4320 What is NCrsquos z-score

1 103

2 -103

3 239

4 1865

5 -1865

Section 34Measures of Position (also called Measures of Relative Standing)

Quartiles

5-Number Summary

Interquartile Range Another Measure of Spread

Boxplots

m = median = 34

Q1= first quartile = 23

Q3= third quartile = 42

1 1 062 2 123 3 164 4 195 5 156 6 217 7 238 6 239 5 2510 4 2811 3 2912 2 3313 1 3414 2 3615 3 3716 4 3817 5 3918 6 4119 7 4220 6 4521 5 4722 4 4923 3 5324 2 5625 1 61

Quartiles Measuring spread by examining the middleThe first quartile Q1 is the value in the

sample that has 25 of the data at or

below it (Q1 is the median of the lower

half of the sorted data)

The third quartile Q3 is the value in the

sample that has 75 of the data at or

below it (Q3 is the median of the upper

half of the sorted data)

Quartiles and median divide data into 4 pieces

Q1 M Q3

14 14 14 14

Quartiles are common measures of spread

httpoirpncsueduiradmit

httpoirpncsueduunivpeer

University of Southern California

Economic Value of College Majors

Rules for Calculating QuartilesStep 1 find the median of all the data (the median divides the data in half)

Step 2a find the median of the lower half this median is Q1Step 2b find the median of the upper half this median is Q3

Importantwhen n is odd include the overall median in both halveswhen n is even do not include the overall median in either half

Example 2 4 6 8 10 12 14 16 18 20 n = 10

Median m = (10+12)2 = 222 = 11

Q1 median of lower half 2 4 6 8 10

Q1 = 6

Q3 median of upper half 12 14 16 18 20

Q3 = 16

11

Pulse Rates n = 138

Stem Leaves4

3 4 5889 5 00123344410 5 555678889923 6 0001111112223333334444423 6 5555666666777778888888816 7 0000011222233444423 7 5555566666677788888899910 8 000011222410 8 55556677894 9 00122 9 584 10 0223

101 11 1

Median mean of pulses in locations 69 amp 70 median= (70+70)2=70

Q1 median of lower half (lower half = 69 smallest pulses) Q1 = pulse in ordered position 35Q1 = 63

Q3 median of upper half (upper half = 69 largest pulses) Q3= pulse in position 35 from the high end Q3=78

Below are the weights of 31 linemen on the NCSU football team What is the

value of the first quartile Q1

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 287

2 2575

3 2635

4 2625

Interquartile range another measure of spread

lower quartile Q1

middle quartile median upper quartile Q3

interquartile range (IQR)

IQR = Q3 ndash Q1

measures spread of middle 50 of the data

Example beginning pulse rates

Q3 = 78 Q1 = 63

IQR = 78 ndash 63 = 15

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 82: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Summary of Notation

2

SAMPLE

sample mean

sample median

sample variance

sample stand dev

y

m

s

s

2

POPULATION

population mean

population median

population variance

population stand dev

m

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

68-95-997 rule

Mean andStandard Deviation

(numerical)

Histogram(graphical)

68-95-997 rule

The 68-95-997 ruleIf the histogram of the data is

approximately bell-shaped then1) approximately of the measurements

are of the mean

that is in ( )

2) approximately of the measurement

68

within 1 standard deviation

95

within 2 standard deviation

s

are of the meas n

that is

y s y s

almost all

within 3 standard deviation

in ( 2 2 )

3) the measurements

are of the mean

that is in ( 3 3 )

s

y s y s

y s y s

68-95-997 rule 68 within 1 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

68

3434

y-s y y+s

68-95-997 rule 95 within 2 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

95

475 475

y-2s y y+2s

Example textbook costs

37548

4272

50

y

s

n

286 291 307 308 315 316 327328 340 342 346 347 348 348 349354 355 355 360 361 364 367 369371 373 377 380 381 382 385 385387 390 390 397 398 409 409 410418 422 424 425 426 428 433 434437 440 480

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( ) (33276 41820)

32percentage of data values in this interval 64

5068-95-997 rule 68

y s

y s y s

1 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 2 2 ) (29004 46092)

48percentage of data values in this interval 96

5068-95-997 rule 95

y s

y s y s

2 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 3 3 ) (24732 50364)

50percentage of data values in this interval 100

5068-95-997 rule 997

y s

y s y s

3 standard deviation interval about the mean

The best estimate of the standard deviation of the menrsquos weights

displayed in this dotplot is

1 10

2 15

3 20

4 40

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

Preceding slides Next

Z-scores Standardized Data Values

Measures the distance of a number from the mean in units of

the standard deviation

z-score corresponding to y

where

original data value

the sample mean

s the sample standard deviation

the z-score corresponding to

y yz

s

y

y

z y

Exam 1 y1 = 88 s1 = 6 your exam 1 score 91

Exam 2 y2 = 88 s2 = 10 your exam 2 score 92

Which score is better

1

2

91 88 3z 5

6 692 88 4

z 410 10

91 on exam 1 is better than 92 on exam 2

If data has mean and standard deviation

then standardizing a particular value of

indicates how many standard deviations

is above or below the mean

y s

y

y

y

Comparing SAT and ACT Scores

SAT Math Eleanorrsquos score 680

SAT mean =500 sd=100 ACT Math Geraldrsquos score 27

ACT mean=18 sd=6 Eleanorrsquos z-score z=(680-500)100=18 Geraldrsquos z-score z=(27-18)6=15 Eleanorrsquos score is better

Z-scores add to zeroStudentInstitutional Support to Athletic Depts For the 9 Public ACC

Schools 2013 ($ millions)

School Support y - ybar Z-score

Maryland 155 64 179

UVA 131 40 112

Louisville 109 18 050

UNC 92 01 003

VaTech 79 -12 -034

FSU 79 -12 -034

GaTech 71 -20 -056

NCSU 65 -26 -073

Clemson 38 -53 -147

Mean=91000 s=35697

Sum = 0 Sum = 0

Recently the mean tuition at 4-yr public collegesuniversities in the US was $6185 with a standard deviation of $1804 In NC the mean tuition was $4320 What is NCrsquos z-score

1 103

2 -103

3 239

4 1865

5 -1865

Section 34Measures of Position (also called Measures of Relative Standing)

Quartiles

5-Number Summary

Interquartile Range Another Measure of Spread

Boxplots

m = median = 34

Q1= first quartile = 23

Q3= third quartile = 42

1 1 062 2 123 3 164 4 195 5 156 6 217 7 238 6 239 5 2510 4 2811 3 2912 2 3313 1 3414 2 3615 3 3716 4 3817 5 3918 6 4119 7 4220 6 4521 5 4722 4 4923 3 5324 2 5625 1 61

Quartiles Measuring spread by examining the middleThe first quartile Q1 is the value in the

sample that has 25 of the data at or

below it (Q1 is the median of the lower

half of the sorted data)

The third quartile Q3 is the value in the

sample that has 75 of the data at or

below it (Q3 is the median of the upper

half of the sorted data)

Quartiles and median divide data into 4 pieces

Q1 M Q3

14 14 14 14

Quartiles are common measures of spread

httpoirpncsueduiradmit

httpoirpncsueduunivpeer

University of Southern California

Economic Value of College Majors

Rules for Calculating QuartilesStep 1 find the median of all the data (the median divides the data in half)

Step 2a find the median of the lower half this median is Q1Step 2b find the median of the upper half this median is Q3

Importantwhen n is odd include the overall median in both halveswhen n is even do not include the overall median in either half

Example 2 4 6 8 10 12 14 16 18 20 n = 10

Median m = (10+12)2 = 222 = 11

Q1 median of lower half 2 4 6 8 10

Q1 = 6

Q3 median of upper half 12 14 16 18 20

Q3 = 16

11

Pulse Rates n = 138

Stem Leaves4

3 4 5889 5 00123344410 5 555678889923 6 0001111112223333334444423 6 5555666666777778888888816 7 0000011222233444423 7 5555566666677788888899910 8 000011222410 8 55556677894 9 00122 9 584 10 0223

101 11 1

Median mean of pulses in locations 69 amp 70 median= (70+70)2=70

Q1 median of lower half (lower half = 69 smallest pulses) Q1 = pulse in ordered position 35Q1 = 63

Q3 median of upper half (upper half = 69 largest pulses) Q3= pulse in position 35 from the high end Q3=78

Below are the weights of 31 linemen on the NCSU football team What is the

value of the first quartile Q1

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 287

2 2575

3 2635

4 2625

Interquartile range another measure of spread

lower quartile Q1

middle quartile median upper quartile Q3

interquartile range (IQR)

IQR = Q3 ndash Q1

measures spread of middle 50 of the data

Example beginning pulse rates

Q3 = 78 Q1 = 63

IQR = 78 ndash 63 = 15

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 83: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

68-95-997 rule

Mean andStandard Deviation

(numerical)

Histogram(graphical)

68-95-997 rule

The 68-95-997 ruleIf the histogram of the data is

approximately bell-shaped then1) approximately of the measurements

are of the mean

that is in ( )

2) approximately of the measurement

68

within 1 standard deviation

95

within 2 standard deviation

s

are of the meas n

that is

y s y s

almost all

within 3 standard deviation

in ( 2 2 )

3) the measurements

are of the mean

that is in ( 3 3 )

s

y s y s

y s y s

68-95-997 rule 68 within 1 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

68

3434

y-s y y+s

68-95-997 rule 95 within 2 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

95

475 475

y-2s y y+2s

Example textbook costs

37548

4272

50

y

s

n

286 291 307 308 315 316 327328 340 342 346 347 348 348 349354 355 355 360 361 364 367 369371 373 377 380 381 382 385 385387 390 390 397 398 409 409 410418 422 424 425 426 428 433 434437 440 480

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( ) (33276 41820)

32percentage of data values in this interval 64

5068-95-997 rule 68

y s

y s y s

1 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 2 2 ) (29004 46092)

48percentage of data values in this interval 96

5068-95-997 rule 95

y s

y s y s

2 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 3 3 ) (24732 50364)

50percentage of data values in this interval 100

5068-95-997 rule 997

y s

y s y s

3 standard deviation interval about the mean

The best estimate of the standard deviation of the menrsquos weights

displayed in this dotplot is

1 10

2 15

3 20

4 40

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

Preceding slides Next

Z-scores Standardized Data Values

Measures the distance of a number from the mean in units of

the standard deviation

z-score corresponding to y

where

original data value

the sample mean

s the sample standard deviation

the z-score corresponding to

y yz

s

y

y

z y

Exam 1 y1 = 88 s1 = 6 your exam 1 score 91

Exam 2 y2 = 88 s2 = 10 your exam 2 score 92

Which score is better

1

2

91 88 3z 5

6 692 88 4

z 410 10

91 on exam 1 is better than 92 on exam 2

If data has mean and standard deviation

then standardizing a particular value of

indicates how many standard deviations

is above or below the mean

y s

y

y

y

Comparing SAT and ACT Scores

SAT Math Eleanorrsquos score 680

SAT mean =500 sd=100 ACT Math Geraldrsquos score 27

ACT mean=18 sd=6 Eleanorrsquos z-score z=(680-500)100=18 Geraldrsquos z-score z=(27-18)6=15 Eleanorrsquos score is better

Z-scores add to zeroStudentInstitutional Support to Athletic Depts For the 9 Public ACC

Schools 2013 ($ millions)

School Support y - ybar Z-score

Maryland 155 64 179

UVA 131 40 112

Louisville 109 18 050

UNC 92 01 003

VaTech 79 -12 -034

FSU 79 -12 -034

GaTech 71 -20 -056

NCSU 65 -26 -073

Clemson 38 -53 -147

Mean=91000 s=35697

Sum = 0 Sum = 0

Recently the mean tuition at 4-yr public collegesuniversities in the US was $6185 with a standard deviation of $1804 In NC the mean tuition was $4320 What is NCrsquos z-score

1 103

2 -103

3 239

4 1865

5 -1865

Section 34Measures of Position (also called Measures of Relative Standing)

Quartiles

5-Number Summary

Interquartile Range Another Measure of Spread

Boxplots

m = median = 34

Q1= first quartile = 23

Q3= third quartile = 42

1 1 062 2 123 3 164 4 195 5 156 6 217 7 238 6 239 5 2510 4 2811 3 2912 2 3313 1 3414 2 3615 3 3716 4 3817 5 3918 6 4119 7 4220 6 4521 5 4722 4 4923 3 5324 2 5625 1 61

Quartiles Measuring spread by examining the middleThe first quartile Q1 is the value in the

sample that has 25 of the data at or

below it (Q1 is the median of the lower

half of the sorted data)

The third quartile Q3 is the value in the

sample that has 75 of the data at or

below it (Q3 is the median of the upper

half of the sorted data)

Quartiles and median divide data into 4 pieces

Q1 M Q3

14 14 14 14

Quartiles are common measures of spread

httpoirpncsueduiradmit

httpoirpncsueduunivpeer

University of Southern California

Economic Value of College Majors

Rules for Calculating QuartilesStep 1 find the median of all the data (the median divides the data in half)

Step 2a find the median of the lower half this median is Q1Step 2b find the median of the upper half this median is Q3

Importantwhen n is odd include the overall median in both halveswhen n is even do not include the overall median in either half

Example 2 4 6 8 10 12 14 16 18 20 n = 10

Median m = (10+12)2 = 222 = 11

Q1 median of lower half 2 4 6 8 10

Q1 = 6

Q3 median of upper half 12 14 16 18 20

Q3 = 16

11

Pulse Rates n = 138

Stem Leaves4

3 4 5889 5 00123344410 5 555678889923 6 0001111112223333334444423 6 5555666666777778888888816 7 0000011222233444423 7 5555566666677788888899910 8 000011222410 8 55556677894 9 00122 9 584 10 0223

101 11 1

Median mean of pulses in locations 69 amp 70 median= (70+70)2=70

Q1 median of lower half (lower half = 69 smallest pulses) Q1 = pulse in ordered position 35Q1 = 63

Q3 median of upper half (upper half = 69 largest pulses) Q3= pulse in position 35 from the high end Q3=78

Below are the weights of 31 linemen on the NCSU football team What is the

value of the first quartile Q1

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 287

2 2575

3 2635

4 2625

Interquartile range another measure of spread

lower quartile Q1

middle quartile median upper quartile Q3

interquartile range (IQR)

IQR = Q3 ndash Q1

measures spread of middle 50 of the data

Example beginning pulse rates

Q3 = 78 Q1 = 63

IQR = 78 ndash 63 = 15

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 84: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

68-95-997 rule

Mean andStandard Deviation

(numerical)

Histogram(graphical)

68-95-997 rule

The 68-95-997 ruleIf the histogram of the data is

approximately bell-shaped then1) approximately of the measurements

are of the mean

that is in ( )

2) approximately of the measurement

68

within 1 standard deviation

95

within 2 standard deviation

s

are of the meas n

that is

y s y s

almost all

within 3 standard deviation

in ( 2 2 )

3) the measurements

are of the mean

that is in ( 3 3 )

s

y s y s

y s y s

68-95-997 rule 68 within 1 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

68

3434

y-s y y+s

68-95-997 rule 95 within 2 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

95

475 475

y-2s y y+2s

Example textbook costs

37548

4272

50

y

s

n

286 291 307 308 315 316 327328 340 342 346 347 348 348 349354 355 355 360 361 364 367 369371 373 377 380 381 382 385 385387 390 390 397 398 409 409 410418 422 424 425 426 428 433 434437 440 480

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( ) (33276 41820)

32percentage of data values in this interval 64

5068-95-997 rule 68

y s

y s y s

1 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 2 2 ) (29004 46092)

48percentage of data values in this interval 96

5068-95-997 rule 95

y s

y s y s

2 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 3 3 ) (24732 50364)

50percentage of data values in this interval 100

5068-95-997 rule 997

y s

y s y s

3 standard deviation interval about the mean

The best estimate of the standard deviation of the menrsquos weights

displayed in this dotplot is

1 10

2 15

3 20

4 40

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

Preceding slides Next

Z-scores Standardized Data Values

Measures the distance of a number from the mean in units of

the standard deviation

z-score corresponding to y

where

original data value

the sample mean

s the sample standard deviation

the z-score corresponding to

y yz

s

y

y

z y

Exam 1 y1 = 88 s1 = 6 your exam 1 score 91

Exam 2 y2 = 88 s2 = 10 your exam 2 score 92

Which score is better

1

2

91 88 3z 5

6 692 88 4

z 410 10

91 on exam 1 is better than 92 on exam 2

If data has mean and standard deviation

then standardizing a particular value of

indicates how many standard deviations

is above or below the mean

y s

y

y

y

Comparing SAT and ACT Scores

SAT Math Eleanorrsquos score 680

SAT mean =500 sd=100 ACT Math Geraldrsquos score 27

ACT mean=18 sd=6 Eleanorrsquos z-score z=(680-500)100=18 Geraldrsquos z-score z=(27-18)6=15 Eleanorrsquos score is better

Z-scores add to zeroStudentInstitutional Support to Athletic Depts For the 9 Public ACC

Schools 2013 ($ millions)

School Support y - ybar Z-score

Maryland 155 64 179

UVA 131 40 112

Louisville 109 18 050

UNC 92 01 003

VaTech 79 -12 -034

FSU 79 -12 -034

GaTech 71 -20 -056

NCSU 65 -26 -073

Clemson 38 -53 -147

Mean=91000 s=35697

Sum = 0 Sum = 0

Recently the mean tuition at 4-yr public collegesuniversities in the US was $6185 with a standard deviation of $1804 In NC the mean tuition was $4320 What is NCrsquos z-score

1 103

2 -103

3 239

4 1865

5 -1865

Section 34Measures of Position (also called Measures of Relative Standing)

Quartiles

5-Number Summary

Interquartile Range Another Measure of Spread

Boxplots

m = median = 34

Q1= first quartile = 23

Q3= third quartile = 42

1 1 062 2 123 3 164 4 195 5 156 6 217 7 238 6 239 5 2510 4 2811 3 2912 2 3313 1 3414 2 3615 3 3716 4 3817 5 3918 6 4119 7 4220 6 4521 5 4722 4 4923 3 5324 2 5625 1 61

Quartiles Measuring spread by examining the middleThe first quartile Q1 is the value in the

sample that has 25 of the data at or

below it (Q1 is the median of the lower

half of the sorted data)

The third quartile Q3 is the value in the

sample that has 75 of the data at or

below it (Q3 is the median of the upper

half of the sorted data)

Quartiles and median divide data into 4 pieces

Q1 M Q3

14 14 14 14

Quartiles are common measures of spread

httpoirpncsueduiradmit

httpoirpncsueduunivpeer

University of Southern California

Economic Value of College Majors

Rules for Calculating QuartilesStep 1 find the median of all the data (the median divides the data in half)

Step 2a find the median of the lower half this median is Q1Step 2b find the median of the upper half this median is Q3

Importantwhen n is odd include the overall median in both halveswhen n is even do not include the overall median in either half

Example 2 4 6 8 10 12 14 16 18 20 n = 10

Median m = (10+12)2 = 222 = 11

Q1 median of lower half 2 4 6 8 10

Q1 = 6

Q3 median of upper half 12 14 16 18 20

Q3 = 16

11

Pulse Rates n = 138

Stem Leaves4

3 4 5889 5 00123344410 5 555678889923 6 0001111112223333334444423 6 5555666666777778888888816 7 0000011222233444423 7 5555566666677788888899910 8 000011222410 8 55556677894 9 00122 9 584 10 0223

101 11 1

Median mean of pulses in locations 69 amp 70 median= (70+70)2=70

Q1 median of lower half (lower half = 69 smallest pulses) Q1 = pulse in ordered position 35Q1 = 63

Q3 median of upper half (upper half = 69 largest pulses) Q3= pulse in position 35 from the high end Q3=78

Below are the weights of 31 linemen on the NCSU football team What is the

value of the first quartile Q1

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 287

2 2575

3 2635

4 2625

Interquartile range another measure of spread

lower quartile Q1

middle quartile median upper quartile Q3

interquartile range (IQR)

IQR = Q3 ndash Q1

measures spread of middle 50 of the data

Example beginning pulse rates

Q3 = 78 Q1 = 63

IQR = 78 ndash 63 = 15

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 85: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

The 68-95-997 ruleIf the histogram of the data is

approximately bell-shaped then1) approximately of the measurements

are of the mean

that is in ( )

2) approximately of the measurement

68

within 1 standard deviation

95

within 2 standard deviation

s

are of the meas n

that is

y s y s

almost all

within 3 standard deviation

in ( 2 2 )

3) the measurements

are of the mean

that is in ( 3 3 )

s

y s y s

y s y s

68-95-997 rule 68 within 1 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

68

3434

y-s y y+s

68-95-997 rule 95 within 2 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

95

475 475

y-2s y y+2s

Example textbook costs

37548

4272

50

y

s

n

286 291 307 308 315 316 327328 340 342 346 347 348 348 349354 355 355 360 361 364 367 369371 373 377 380 381 382 385 385387 390 390 397 398 409 409 410418 422 424 425 426 428 433 434437 440 480

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( ) (33276 41820)

32percentage of data values in this interval 64

5068-95-997 rule 68

y s

y s y s

1 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 2 2 ) (29004 46092)

48percentage of data values in this interval 96

5068-95-997 rule 95

y s

y s y s

2 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 3 3 ) (24732 50364)

50percentage of data values in this interval 100

5068-95-997 rule 997

y s

y s y s

3 standard deviation interval about the mean

The best estimate of the standard deviation of the menrsquos weights

displayed in this dotplot is

1 10

2 15

3 20

4 40

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

Preceding slides Next

Z-scores Standardized Data Values

Measures the distance of a number from the mean in units of

the standard deviation

z-score corresponding to y

where

original data value

the sample mean

s the sample standard deviation

the z-score corresponding to

y yz

s

y

y

z y

Exam 1 y1 = 88 s1 = 6 your exam 1 score 91

Exam 2 y2 = 88 s2 = 10 your exam 2 score 92

Which score is better

1

2

91 88 3z 5

6 692 88 4

z 410 10

91 on exam 1 is better than 92 on exam 2

If data has mean and standard deviation

then standardizing a particular value of

indicates how many standard deviations

is above or below the mean

y s

y

y

y

Comparing SAT and ACT Scores

SAT Math Eleanorrsquos score 680

SAT mean =500 sd=100 ACT Math Geraldrsquos score 27

ACT mean=18 sd=6 Eleanorrsquos z-score z=(680-500)100=18 Geraldrsquos z-score z=(27-18)6=15 Eleanorrsquos score is better

Z-scores add to zeroStudentInstitutional Support to Athletic Depts For the 9 Public ACC

Schools 2013 ($ millions)

School Support y - ybar Z-score

Maryland 155 64 179

UVA 131 40 112

Louisville 109 18 050

UNC 92 01 003

VaTech 79 -12 -034

FSU 79 -12 -034

GaTech 71 -20 -056

NCSU 65 -26 -073

Clemson 38 -53 -147

Mean=91000 s=35697

Sum = 0 Sum = 0

Recently the mean tuition at 4-yr public collegesuniversities in the US was $6185 with a standard deviation of $1804 In NC the mean tuition was $4320 What is NCrsquos z-score

1 103

2 -103

3 239

4 1865

5 -1865

Section 34Measures of Position (also called Measures of Relative Standing)

Quartiles

5-Number Summary

Interquartile Range Another Measure of Spread

Boxplots

m = median = 34

Q1= first quartile = 23

Q3= third quartile = 42

1 1 062 2 123 3 164 4 195 5 156 6 217 7 238 6 239 5 2510 4 2811 3 2912 2 3313 1 3414 2 3615 3 3716 4 3817 5 3918 6 4119 7 4220 6 4521 5 4722 4 4923 3 5324 2 5625 1 61

Quartiles Measuring spread by examining the middleThe first quartile Q1 is the value in the

sample that has 25 of the data at or

below it (Q1 is the median of the lower

half of the sorted data)

The third quartile Q3 is the value in the

sample that has 75 of the data at or

below it (Q3 is the median of the upper

half of the sorted data)

Quartiles and median divide data into 4 pieces

Q1 M Q3

14 14 14 14

Quartiles are common measures of spread

httpoirpncsueduiradmit

httpoirpncsueduunivpeer

University of Southern California

Economic Value of College Majors

Rules for Calculating QuartilesStep 1 find the median of all the data (the median divides the data in half)

Step 2a find the median of the lower half this median is Q1Step 2b find the median of the upper half this median is Q3

Importantwhen n is odd include the overall median in both halveswhen n is even do not include the overall median in either half

Example 2 4 6 8 10 12 14 16 18 20 n = 10

Median m = (10+12)2 = 222 = 11

Q1 median of lower half 2 4 6 8 10

Q1 = 6

Q3 median of upper half 12 14 16 18 20

Q3 = 16

11

Pulse Rates n = 138

Stem Leaves4

3 4 5889 5 00123344410 5 555678889923 6 0001111112223333334444423 6 5555666666777778888888816 7 0000011222233444423 7 5555566666677788888899910 8 000011222410 8 55556677894 9 00122 9 584 10 0223

101 11 1

Median mean of pulses in locations 69 amp 70 median= (70+70)2=70

Q1 median of lower half (lower half = 69 smallest pulses) Q1 = pulse in ordered position 35Q1 = 63

Q3 median of upper half (upper half = 69 largest pulses) Q3= pulse in position 35 from the high end Q3=78

Below are the weights of 31 linemen on the NCSU football team What is the

value of the first quartile Q1

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 287

2 2575

3 2635

4 2625

Interquartile range another measure of spread

lower quartile Q1

middle quartile median upper quartile Q3

interquartile range (IQR)

IQR = Q3 ndash Q1

measures spread of middle 50 of the data

Example beginning pulse rates

Q3 = 78 Q1 = 63

IQR = 78 ndash 63 = 15

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 86: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

68-95-997 rule 68 within 1 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

68

3434

y-s y y+s

68-95-997 rule 95 within 2 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

95

475 475

y-2s y y+2s

Example textbook costs

37548

4272

50

y

s

n

286 291 307 308 315 316 327328 340 342 346 347 348 348 349354 355 355 360 361 364 367 369371 373 377 380 381 382 385 385387 390 390 397 398 409 409 410418 422 424 425 426 428 433 434437 440 480

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( ) (33276 41820)

32percentage of data values in this interval 64

5068-95-997 rule 68

y s

y s y s

1 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 2 2 ) (29004 46092)

48percentage of data values in this interval 96

5068-95-997 rule 95

y s

y s y s

2 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 3 3 ) (24732 50364)

50percentage of data values in this interval 100

5068-95-997 rule 997

y s

y s y s

3 standard deviation interval about the mean

The best estimate of the standard deviation of the menrsquos weights

displayed in this dotplot is

1 10

2 15

3 20

4 40

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

Preceding slides Next

Z-scores Standardized Data Values

Measures the distance of a number from the mean in units of

the standard deviation

z-score corresponding to y

where

original data value

the sample mean

s the sample standard deviation

the z-score corresponding to

y yz

s

y

y

z y

Exam 1 y1 = 88 s1 = 6 your exam 1 score 91

Exam 2 y2 = 88 s2 = 10 your exam 2 score 92

Which score is better

1

2

91 88 3z 5

6 692 88 4

z 410 10

91 on exam 1 is better than 92 on exam 2

If data has mean and standard deviation

then standardizing a particular value of

indicates how many standard deviations

is above or below the mean

y s

y

y

y

Comparing SAT and ACT Scores

SAT Math Eleanorrsquos score 680

SAT mean =500 sd=100 ACT Math Geraldrsquos score 27

ACT mean=18 sd=6 Eleanorrsquos z-score z=(680-500)100=18 Geraldrsquos z-score z=(27-18)6=15 Eleanorrsquos score is better

Z-scores add to zeroStudentInstitutional Support to Athletic Depts For the 9 Public ACC

Schools 2013 ($ millions)

School Support y - ybar Z-score

Maryland 155 64 179

UVA 131 40 112

Louisville 109 18 050

UNC 92 01 003

VaTech 79 -12 -034

FSU 79 -12 -034

GaTech 71 -20 -056

NCSU 65 -26 -073

Clemson 38 -53 -147

Mean=91000 s=35697

Sum = 0 Sum = 0

Recently the mean tuition at 4-yr public collegesuniversities in the US was $6185 with a standard deviation of $1804 In NC the mean tuition was $4320 What is NCrsquos z-score

1 103

2 -103

3 239

4 1865

5 -1865

Section 34Measures of Position (also called Measures of Relative Standing)

Quartiles

5-Number Summary

Interquartile Range Another Measure of Spread

Boxplots

m = median = 34

Q1= first quartile = 23

Q3= third quartile = 42

1 1 062 2 123 3 164 4 195 5 156 6 217 7 238 6 239 5 2510 4 2811 3 2912 2 3313 1 3414 2 3615 3 3716 4 3817 5 3918 6 4119 7 4220 6 4521 5 4722 4 4923 3 5324 2 5625 1 61

Quartiles Measuring spread by examining the middleThe first quartile Q1 is the value in the

sample that has 25 of the data at or

below it (Q1 is the median of the lower

half of the sorted data)

The third quartile Q3 is the value in the

sample that has 75 of the data at or

below it (Q3 is the median of the upper

half of the sorted data)

Quartiles and median divide data into 4 pieces

Q1 M Q3

14 14 14 14

Quartiles are common measures of spread

httpoirpncsueduiradmit

httpoirpncsueduunivpeer

University of Southern California

Economic Value of College Majors

Rules for Calculating QuartilesStep 1 find the median of all the data (the median divides the data in half)

Step 2a find the median of the lower half this median is Q1Step 2b find the median of the upper half this median is Q3

Importantwhen n is odd include the overall median in both halveswhen n is even do not include the overall median in either half

Example 2 4 6 8 10 12 14 16 18 20 n = 10

Median m = (10+12)2 = 222 = 11

Q1 median of lower half 2 4 6 8 10

Q1 = 6

Q3 median of upper half 12 14 16 18 20

Q3 = 16

11

Pulse Rates n = 138

Stem Leaves4

3 4 5889 5 00123344410 5 555678889923 6 0001111112223333334444423 6 5555666666777778888888816 7 0000011222233444423 7 5555566666677788888899910 8 000011222410 8 55556677894 9 00122 9 584 10 0223

101 11 1

Median mean of pulses in locations 69 amp 70 median= (70+70)2=70

Q1 median of lower half (lower half = 69 smallest pulses) Q1 = pulse in ordered position 35Q1 = 63

Q3 median of upper half (upper half = 69 largest pulses) Q3= pulse in position 35 from the high end Q3=78

Below are the weights of 31 linemen on the NCSU football team What is the

value of the first quartile Q1

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 287

2 2575

3 2635

4 2625

Interquartile range another measure of spread

lower quartile Q1

middle quartile median upper quartile Q3

interquartile range (IQR)

IQR = Q3 ndash Q1

measures spread of middle 50 of the data

Example beginning pulse rates

Q3 = 78 Q1 = 63

IQR = 78 ndash 63 = 15

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 87: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

68-95-997 rule 95 within 2 stan dev of the mean

0

005

01

015

02

025

03

035

04

045

95

475 475

y-2s y y+2s

Example textbook costs

37548

4272

50

y

s

n

286 291 307 308 315 316 327328 340 342 346 347 348 348 349354 355 355 360 361 364 367 369371 373 377 380 381 382 385 385387 390 390 397 398 409 409 410418 422 424 425 426 428 433 434437 440 480

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( ) (33276 41820)

32percentage of data values in this interval 64

5068-95-997 rule 68

y s

y s y s

1 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 2 2 ) (29004 46092)

48percentage of data values in this interval 96

5068-95-997 rule 95

y s

y s y s

2 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 3 3 ) (24732 50364)

50percentage of data values in this interval 100

5068-95-997 rule 997

y s

y s y s

3 standard deviation interval about the mean

The best estimate of the standard deviation of the menrsquos weights

displayed in this dotplot is

1 10

2 15

3 20

4 40

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

Preceding slides Next

Z-scores Standardized Data Values

Measures the distance of a number from the mean in units of

the standard deviation

z-score corresponding to y

where

original data value

the sample mean

s the sample standard deviation

the z-score corresponding to

y yz

s

y

y

z y

Exam 1 y1 = 88 s1 = 6 your exam 1 score 91

Exam 2 y2 = 88 s2 = 10 your exam 2 score 92

Which score is better

1

2

91 88 3z 5

6 692 88 4

z 410 10

91 on exam 1 is better than 92 on exam 2

If data has mean and standard deviation

then standardizing a particular value of

indicates how many standard deviations

is above or below the mean

y s

y

y

y

Comparing SAT and ACT Scores

SAT Math Eleanorrsquos score 680

SAT mean =500 sd=100 ACT Math Geraldrsquos score 27

ACT mean=18 sd=6 Eleanorrsquos z-score z=(680-500)100=18 Geraldrsquos z-score z=(27-18)6=15 Eleanorrsquos score is better

Z-scores add to zeroStudentInstitutional Support to Athletic Depts For the 9 Public ACC

Schools 2013 ($ millions)

School Support y - ybar Z-score

Maryland 155 64 179

UVA 131 40 112

Louisville 109 18 050

UNC 92 01 003

VaTech 79 -12 -034

FSU 79 -12 -034

GaTech 71 -20 -056

NCSU 65 -26 -073

Clemson 38 -53 -147

Mean=91000 s=35697

Sum = 0 Sum = 0

Recently the mean tuition at 4-yr public collegesuniversities in the US was $6185 with a standard deviation of $1804 In NC the mean tuition was $4320 What is NCrsquos z-score

1 103

2 -103

3 239

4 1865

5 -1865

Section 34Measures of Position (also called Measures of Relative Standing)

Quartiles

5-Number Summary

Interquartile Range Another Measure of Spread

Boxplots

m = median = 34

Q1= first quartile = 23

Q3= third quartile = 42

1 1 062 2 123 3 164 4 195 5 156 6 217 7 238 6 239 5 2510 4 2811 3 2912 2 3313 1 3414 2 3615 3 3716 4 3817 5 3918 6 4119 7 4220 6 4521 5 4722 4 4923 3 5324 2 5625 1 61

Quartiles Measuring spread by examining the middleThe first quartile Q1 is the value in the

sample that has 25 of the data at or

below it (Q1 is the median of the lower

half of the sorted data)

The third quartile Q3 is the value in the

sample that has 75 of the data at or

below it (Q3 is the median of the upper

half of the sorted data)

Quartiles and median divide data into 4 pieces

Q1 M Q3

14 14 14 14

Quartiles are common measures of spread

httpoirpncsueduiradmit

httpoirpncsueduunivpeer

University of Southern California

Economic Value of College Majors

Rules for Calculating QuartilesStep 1 find the median of all the data (the median divides the data in half)

Step 2a find the median of the lower half this median is Q1Step 2b find the median of the upper half this median is Q3

Importantwhen n is odd include the overall median in both halveswhen n is even do not include the overall median in either half

Example 2 4 6 8 10 12 14 16 18 20 n = 10

Median m = (10+12)2 = 222 = 11

Q1 median of lower half 2 4 6 8 10

Q1 = 6

Q3 median of upper half 12 14 16 18 20

Q3 = 16

11

Pulse Rates n = 138

Stem Leaves4

3 4 5889 5 00123344410 5 555678889923 6 0001111112223333334444423 6 5555666666777778888888816 7 0000011222233444423 7 5555566666677788888899910 8 000011222410 8 55556677894 9 00122 9 584 10 0223

101 11 1

Median mean of pulses in locations 69 amp 70 median= (70+70)2=70

Q1 median of lower half (lower half = 69 smallest pulses) Q1 = pulse in ordered position 35Q1 = 63

Q3 median of upper half (upper half = 69 largest pulses) Q3= pulse in position 35 from the high end Q3=78

Below are the weights of 31 linemen on the NCSU football team What is the

value of the first quartile Q1

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 287

2 2575

3 2635

4 2625

Interquartile range another measure of spread

lower quartile Q1

middle quartile median upper quartile Q3

interquartile range (IQR)

IQR = Q3 ndash Q1

measures spread of middle 50 of the data

Example beginning pulse rates

Q3 = 78 Q1 = 63

IQR = 78 ndash 63 = 15

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 88: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Example textbook costs

37548

4272

50

y

s

n

286 291 307 308 315 316 327328 340 342 346 347 348 348 349354 355 355 360 361 364 367 369371 373 377 380 381 382 385 385387 390 390 397 398 409 409 410418 422 424 425 426 428 433 434437 440 480

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( ) (33276 41820)

32percentage of data values in this interval 64

5068-95-997 rule 68

y s

y s y s

1 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 2 2 ) (29004 46092)

48percentage of data values in this interval 96

5068-95-997 rule 95

y s

y s y s

2 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 3 3 ) (24732 50364)

50percentage of data values in this interval 100

5068-95-997 rule 997

y s

y s y s

3 standard deviation interval about the mean

The best estimate of the standard deviation of the menrsquos weights

displayed in this dotplot is

1 10

2 15

3 20

4 40

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

Preceding slides Next

Z-scores Standardized Data Values

Measures the distance of a number from the mean in units of

the standard deviation

z-score corresponding to y

where

original data value

the sample mean

s the sample standard deviation

the z-score corresponding to

y yz

s

y

y

z y

Exam 1 y1 = 88 s1 = 6 your exam 1 score 91

Exam 2 y2 = 88 s2 = 10 your exam 2 score 92

Which score is better

1

2

91 88 3z 5

6 692 88 4

z 410 10

91 on exam 1 is better than 92 on exam 2

If data has mean and standard deviation

then standardizing a particular value of

indicates how many standard deviations

is above or below the mean

y s

y

y

y

Comparing SAT and ACT Scores

SAT Math Eleanorrsquos score 680

SAT mean =500 sd=100 ACT Math Geraldrsquos score 27

ACT mean=18 sd=6 Eleanorrsquos z-score z=(680-500)100=18 Geraldrsquos z-score z=(27-18)6=15 Eleanorrsquos score is better

Z-scores add to zeroStudentInstitutional Support to Athletic Depts For the 9 Public ACC

Schools 2013 ($ millions)

School Support y - ybar Z-score

Maryland 155 64 179

UVA 131 40 112

Louisville 109 18 050

UNC 92 01 003

VaTech 79 -12 -034

FSU 79 -12 -034

GaTech 71 -20 -056

NCSU 65 -26 -073

Clemson 38 -53 -147

Mean=91000 s=35697

Sum = 0 Sum = 0

Recently the mean tuition at 4-yr public collegesuniversities in the US was $6185 with a standard deviation of $1804 In NC the mean tuition was $4320 What is NCrsquos z-score

1 103

2 -103

3 239

4 1865

5 -1865

Section 34Measures of Position (also called Measures of Relative Standing)

Quartiles

5-Number Summary

Interquartile Range Another Measure of Spread

Boxplots

m = median = 34

Q1= first quartile = 23

Q3= third quartile = 42

1 1 062 2 123 3 164 4 195 5 156 6 217 7 238 6 239 5 2510 4 2811 3 2912 2 3313 1 3414 2 3615 3 3716 4 3817 5 3918 6 4119 7 4220 6 4521 5 4722 4 4923 3 5324 2 5625 1 61

Quartiles Measuring spread by examining the middleThe first quartile Q1 is the value in the

sample that has 25 of the data at or

below it (Q1 is the median of the lower

half of the sorted data)

The third quartile Q3 is the value in the

sample that has 75 of the data at or

below it (Q3 is the median of the upper

half of the sorted data)

Quartiles and median divide data into 4 pieces

Q1 M Q3

14 14 14 14

Quartiles are common measures of spread

httpoirpncsueduiradmit

httpoirpncsueduunivpeer

University of Southern California

Economic Value of College Majors

Rules for Calculating QuartilesStep 1 find the median of all the data (the median divides the data in half)

Step 2a find the median of the lower half this median is Q1Step 2b find the median of the upper half this median is Q3

Importantwhen n is odd include the overall median in both halveswhen n is even do not include the overall median in either half

Example 2 4 6 8 10 12 14 16 18 20 n = 10

Median m = (10+12)2 = 222 = 11

Q1 median of lower half 2 4 6 8 10

Q1 = 6

Q3 median of upper half 12 14 16 18 20

Q3 = 16

11

Pulse Rates n = 138

Stem Leaves4

3 4 5889 5 00123344410 5 555678889923 6 0001111112223333334444423 6 5555666666777778888888816 7 0000011222233444423 7 5555566666677788888899910 8 000011222410 8 55556677894 9 00122 9 584 10 0223

101 11 1

Median mean of pulses in locations 69 amp 70 median= (70+70)2=70

Q1 median of lower half (lower half = 69 smallest pulses) Q1 = pulse in ordered position 35Q1 = 63

Q3 median of upper half (upper half = 69 largest pulses) Q3= pulse in position 35 from the high end Q3=78

Below are the weights of 31 linemen on the NCSU football team What is the

value of the first quartile Q1

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 287

2 2575

3 2635

4 2625

Interquartile range another measure of spread

lower quartile Q1

middle quartile median upper quartile Q3

interquartile range (IQR)

IQR = Q3 ndash Q1

measures spread of middle 50 of the data

Example beginning pulse rates

Q3 = 78 Q1 = 63

IQR = 78 ndash 63 = 15

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 89: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( ) (33276 41820)

32percentage of data values in this interval 64

5068-95-997 rule 68

y s

y s y s

1 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 2 2 ) (29004 46092)

48percentage of data values in this interval 96

5068-95-997 rule 95

y s

y s y s

2 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 3 3 ) (24732 50364)

50percentage of data values in this interval 100

5068-95-997 rule 997

y s

y s y s

3 standard deviation interval about the mean

The best estimate of the standard deviation of the menrsquos weights

displayed in this dotplot is

1 10

2 15

3 20

4 40

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

Preceding slides Next

Z-scores Standardized Data Values

Measures the distance of a number from the mean in units of

the standard deviation

z-score corresponding to y

where

original data value

the sample mean

s the sample standard deviation

the z-score corresponding to

y yz

s

y

y

z y

Exam 1 y1 = 88 s1 = 6 your exam 1 score 91

Exam 2 y2 = 88 s2 = 10 your exam 2 score 92

Which score is better

1

2

91 88 3z 5

6 692 88 4

z 410 10

91 on exam 1 is better than 92 on exam 2

If data has mean and standard deviation

then standardizing a particular value of

indicates how many standard deviations

is above or below the mean

y s

y

y

y

Comparing SAT and ACT Scores

SAT Math Eleanorrsquos score 680

SAT mean =500 sd=100 ACT Math Geraldrsquos score 27

ACT mean=18 sd=6 Eleanorrsquos z-score z=(680-500)100=18 Geraldrsquos z-score z=(27-18)6=15 Eleanorrsquos score is better

Z-scores add to zeroStudentInstitutional Support to Athletic Depts For the 9 Public ACC

Schools 2013 ($ millions)

School Support y - ybar Z-score

Maryland 155 64 179

UVA 131 40 112

Louisville 109 18 050

UNC 92 01 003

VaTech 79 -12 -034

FSU 79 -12 -034

GaTech 71 -20 -056

NCSU 65 -26 -073

Clemson 38 -53 -147

Mean=91000 s=35697

Sum = 0 Sum = 0

Recently the mean tuition at 4-yr public collegesuniversities in the US was $6185 with a standard deviation of $1804 In NC the mean tuition was $4320 What is NCrsquos z-score

1 103

2 -103

3 239

4 1865

5 -1865

Section 34Measures of Position (also called Measures of Relative Standing)

Quartiles

5-Number Summary

Interquartile Range Another Measure of Spread

Boxplots

m = median = 34

Q1= first quartile = 23

Q3= third quartile = 42

1 1 062 2 123 3 164 4 195 5 156 6 217 7 238 6 239 5 2510 4 2811 3 2912 2 3313 1 3414 2 3615 3 3716 4 3817 5 3918 6 4119 7 4220 6 4521 5 4722 4 4923 3 5324 2 5625 1 61

Quartiles Measuring spread by examining the middleThe first quartile Q1 is the value in the

sample that has 25 of the data at or

below it (Q1 is the median of the lower

half of the sorted data)

The third quartile Q3 is the value in the

sample that has 75 of the data at or

below it (Q3 is the median of the upper

half of the sorted data)

Quartiles and median divide data into 4 pieces

Q1 M Q3

14 14 14 14

Quartiles are common measures of spread

httpoirpncsueduiradmit

httpoirpncsueduunivpeer

University of Southern California

Economic Value of College Majors

Rules for Calculating QuartilesStep 1 find the median of all the data (the median divides the data in half)

Step 2a find the median of the lower half this median is Q1Step 2b find the median of the upper half this median is Q3

Importantwhen n is odd include the overall median in both halveswhen n is even do not include the overall median in either half

Example 2 4 6 8 10 12 14 16 18 20 n = 10

Median m = (10+12)2 = 222 = 11

Q1 median of lower half 2 4 6 8 10

Q1 = 6

Q3 median of upper half 12 14 16 18 20

Q3 = 16

11

Pulse Rates n = 138

Stem Leaves4

3 4 5889 5 00123344410 5 555678889923 6 0001111112223333334444423 6 5555666666777778888888816 7 0000011222233444423 7 5555566666677788888899910 8 000011222410 8 55556677894 9 00122 9 584 10 0223

101 11 1

Median mean of pulses in locations 69 amp 70 median= (70+70)2=70

Q1 median of lower half (lower half = 69 smallest pulses) Q1 = pulse in ordered position 35Q1 = 63

Q3 median of upper half (upper half = 69 largest pulses) Q3= pulse in position 35 from the high end Q3=78

Below are the weights of 31 linemen on the NCSU football team What is the

value of the first quartile Q1

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 287

2 2575

3 2635

4 2625

Interquartile range another measure of spread

lower quartile Q1

middle quartile median upper quartile Q3

interquartile range (IQR)

IQR = Q3 ndash Q1

measures spread of middle 50 of the data

Example beginning pulse rates

Q3 = 78 Q1 = 63

IQR = 78 ndash 63 = 15

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 90: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 2 2 ) (29004 46092)

48percentage of data values in this interval 96

5068-95-997 rule 95

y s

y s y s

2 standard deviation interval about the mean

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 3 3 ) (24732 50364)

50percentage of data values in this interval 100

5068-95-997 rule 997

y s

y s y s

3 standard deviation interval about the mean

The best estimate of the standard deviation of the menrsquos weights

displayed in this dotplot is

1 10

2 15

3 20

4 40

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

Preceding slides Next

Z-scores Standardized Data Values

Measures the distance of a number from the mean in units of

the standard deviation

z-score corresponding to y

where

original data value

the sample mean

s the sample standard deviation

the z-score corresponding to

y yz

s

y

y

z y

Exam 1 y1 = 88 s1 = 6 your exam 1 score 91

Exam 2 y2 = 88 s2 = 10 your exam 2 score 92

Which score is better

1

2

91 88 3z 5

6 692 88 4

z 410 10

91 on exam 1 is better than 92 on exam 2

If data has mean and standard deviation

then standardizing a particular value of

indicates how many standard deviations

is above or below the mean

y s

y

y

y

Comparing SAT and ACT Scores

SAT Math Eleanorrsquos score 680

SAT mean =500 sd=100 ACT Math Geraldrsquos score 27

ACT mean=18 sd=6 Eleanorrsquos z-score z=(680-500)100=18 Geraldrsquos z-score z=(27-18)6=15 Eleanorrsquos score is better

Z-scores add to zeroStudentInstitutional Support to Athletic Depts For the 9 Public ACC

Schools 2013 ($ millions)

School Support y - ybar Z-score

Maryland 155 64 179

UVA 131 40 112

Louisville 109 18 050

UNC 92 01 003

VaTech 79 -12 -034

FSU 79 -12 -034

GaTech 71 -20 -056

NCSU 65 -26 -073

Clemson 38 -53 -147

Mean=91000 s=35697

Sum = 0 Sum = 0

Recently the mean tuition at 4-yr public collegesuniversities in the US was $6185 with a standard deviation of $1804 In NC the mean tuition was $4320 What is NCrsquos z-score

1 103

2 -103

3 239

4 1865

5 -1865

Section 34Measures of Position (also called Measures of Relative Standing)

Quartiles

5-Number Summary

Interquartile Range Another Measure of Spread

Boxplots

m = median = 34

Q1= first quartile = 23

Q3= third quartile = 42

1 1 062 2 123 3 164 4 195 5 156 6 217 7 238 6 239 5 2510 4 2811 3 2912 2 3313 1 3414 2 3615 3 3716 4 3817 5 3918 6 4119 7 4220 6 4521 5 4722 4 4923 3 5324 2 5625 1 61

Quartiles Measuring spread by examining the middleThe first quartile Q1 is the value in the

sample that has 25 of the data at or

below it (Q1 is the median of the lower

half of the sorted data)

The third quartile Q3 is the value in the

sample that has 75 of the data at or

below it (Q3 is the median of the upper

half of the sorted data)

Quartiles and median divide data into 4 pieces

Q1 M Q3

14 14 14 14

Quartiles are common measures of spread

httpoirpncsueduiradmit

httpoirpncsueduunivpeer

University of Southern California

Economic Value of College Majors

Rules for Calculating QuartilesStep 1 find the median of all the data (the median divides the data in half)

Step 2a find the median of the lower half this median is Q1Step 2b find the median of the upper half this median is Q3

Importantwhen n is odd include the overall median in both halveswhen n is even do not include the overall median in either half

Example 2 4 6 8 10 12 14 16 18 20 n = 10

Median m = (10+12)2 = 222 = 11

Q1 median of lower half 2 4 6 8 10

Q1 = 6

Q3 median of upper half 12 14 16 18 20

Q3 = 16

11

Pulse Rates n = 138

Stem Leaves4

3 4 5889 5 00123344410 5 555678889923 6 0001111112223333334444423 6 5555666666777778888888816 7 0000011222233444423 7 5555566666677788888899910 8 000011222410 8 55556677894 9 00122 9 584 10 0223

101 11 1

Median mean of pulses in locations 69 amp 70 median= (70+70)2=70

Q1 median of lower half (lower half = 69 smallest pulses) Q1 = pulse in ordered position 35Q1 = 63

Q3 median of upper half (upper half = 69 largest pulses) Q3= pulse in position 35 from the high end Q3=78

Below are the weights of 31 linemen on the NCSU football team What is the

value of the first quartile Q1

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 287

2 2575

3 2635

4 2625

Interquartile range another measure of spread

lower quartile Q1

middle quartile median upper quartile Q3

interquartile range (IQR)

IQR = Q3 ndash Q1

measures spread of middle 50 of the data

Example beginning pulse rates

Q3 = 78 Q1 = 63

IQR = 78 ndash 63 = 15

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 91: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Example textbook costs (cont)286 291 307 308 315 316 327 328340 342 346 347 348 348 349 354355 355 360 361 364 367 369 371373 377 380 381 382 385 385 387390 390 397 398 409 409 410 418422 424 425 426 428 433 434 437440 480

37548 4272

( 3 3 ) (24732 50364)

50percentage of data values in this interval 100

5068-95-997 rule 997

y s

y s y s

3 standard deviation interval about the mean

The best estimate of the standard deviation of the menrsquos weights

displayed in this dotplot is

1 10

2 15

3 20

4 40

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

Preceding slides Next

Z-scores Standardized Data Values

Measures the distance of a number from the mean in units of

the standard deviation

z-score corresponding to y

where

original data value

the sample mean

s the sample standard deviation

the z-score corresponding to

y yz

s

y

y

z y

Exam 1 y1 = 88 s1 = 6 your exam 1 score 91

Exam 2 y2 = 88 s2 = 10 your exam 2 score 92

Which score is better

1

2

91 88 3z 5

6 692 88 4

z 410 10

91 on exam 1 is better than 92 on exam 2

If data has mean and standard deviation

then standardizing a particular value of

indicates how many standard deviations

is above or below the mean

y s

y

y

y

Comparing SAT and ACT Scores

SAT Math Eleanorrsquos score 680

SAT mean =500 sd=100 ACT Math Geraldrsquos score 27

ACT mean=18 sd=6 Eleanorrsquos z-score z=(680-500)100=18 Geraldrsquos z-score z=(27-18)6=15 Eleanorrsquos score is better

Z-scores add to zeroStudentInstitutional Support to Athletic Depts For the 9 Public ACC

Schools 2013 ($ millions)

School Support y - ybar Z-score

Maryland 155 64 179

UVA 131 40 112

Louisville 109 18 050

UNC 92 01 003

VaTech 79 -12 -034

FSU 79 -12 -034

GaTech 71 -20 -056

NCSU 65 -26 -073

Clemson 38 -53 -147

Mean=91000 s=35697

Sum = 0 Sum = 0

Recently the mean tuition at 4-yr public collegesuniversities in the US was $6185 with a standard deviation of $1804 In NC the mean tuition was $4320 What is NCrsquos z-score

1 103

2 -103

3 239

4 1865

5 -1865

Section 34Measures of Position (also called Measures of Relative Standing)

Quartiles

5-Number Summary

Interquartile Range Another Measure of Spread

Boxplots

m = median = 34

Q1= first quartile = 23

Q3= third quartile = 42

1 1 062 2 123 3 164 4 195 5 156 6 217 7 238 6 239 5 2510 4 2811 3 2912 2 3313 1 3414 2 3615 3 3716 4 3817 5 3918 6 4119 7 4220 6 4521 5 4722 4 4923 3 5324 2 5625 1 61

Quartiles Measuring spread by examining the middleThe first quartile Q1 is the value in the

sample that has 25 of the data at or

below it (Q1 is the median of the lower

half of the sorted data)

The third quartile Q3 is the value in the

sample that has 75 of the data at or

below it (Q3 is the median of the upper

half of the sorted data)

Quartiles and median divide data into 4 pieces

Q1 M Q3

14 14 14 14

Quartiles are common measures of spread

httpoirpncsueduiradmit

httpoirpncsueduunivpeer

University of Southern California

Economic Value of College Majors

Rules for Calculating QuartilesStep 1 find the median of all the data (the median divides the data in half)

Step 2a find the median of the lower half this median is Q1Step 2b find the median of the upper half this median is Q3

Importantwhen n is odd include the overall median in both halveswhen n is even do not include the overall median in either half

Example 2 4 6 8 10 12 14 16 18 20 n = 10

Median m = (10+12)2 = 222 = 11

Q1 median of lower half 2 4 6 8 10

Q1 = 6

Q3 median of upper half 12 14 16 18 20

Q3 = 16

11

Pulse Rates n = 138

Stem Leaves4

3 4 5889 5 00123344410 5 555678889923 6 0001111112223333334444423 6 5555666666777778888888816 7 0000011222233444423 7 5555566666677788888899910 8 000011222410 8 55556677894 9 00122 9 584 10 0223

101 11 1

Median mean of pulses in locations 69 amp 70 median= (70+70)2=70

Q1 median of lower half (lower half = 69 smallest pulses) Q1 = pulse in ordered position 35Q1 = 63

Q3 median of upper half (upper half = 69 largest pulses) Q3= pulse in position 35 from the high end Q3=78

Below are the weights of 31 linemen on the NCSU football team What is the

value of the first quartile Q1

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 287

2 2575

3 2635

4 2625

Interquartile range another measure of spread

lower quartile Q1

middle quartile median upper quartile Q3

interquartile range (IQR)

IQR = Q3 ndash Q1

measures spread of middle 50 of the data

Example beginning pulse rates

Q3 = 78 Q1 = 63

IQR = 78 ndash 63 = 15

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 92: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

The best estimate of the standard deviation of the menrsquos weights

displayed in this dotplot is

1 10

2 15

3 20

4 40

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

Preceding slides Next

Z-scores Standardized Data Values

Measures the distance of a number from the mean in units of

the standard deviation

z-score corresponding to y

where

original data value

the sample mean

s the sample standard deviation

the z-score corresponding to

y yz

s

y

y

z y

Exam 1 y1 = 88 s1 = 6 your exam 1 score 91

Exam 2 y2 = 88 s2 = 10 your exam 2 score 92

Which score is better

1

2

91 88 3z 5

6 692 88 4

z 410 10

91 on exam 1 is better than 92 on exam 2

If data has mean and standard deviation

then standardizing a particular value of

indicates how many standard deviations

is above or below the mean

y s

y

y

y

Comparing SAT and ACT Scores

SAT Math Eleanorrsquos score 680

SAT mean =500 sd=100 ACT Math Geraldrsquos score 27

ACT mean=18 sd=6 Eleanorrsquos z-score z=(680-500)100=18 Geraldrsquos z-score z=(27-18)6=15 Eleanorrsquos score is better

Z-scores add to zeroStudentInstitutional Support to Athletic Depts For the 9 Public ACC

Schools 2013 ($ millions)

School Support y - ybar Z-score

Maryland 155 64 179

UVA 131 40 112

Louisville 109 18 050

UNC 92 01 003

VaTech 79 -12 -034

FSU 79 -12 -034

GaTech 71 -20 -056

NCSU 65 -26 -073

Clemson 38 -53 -147

Mean=91000 s=35697

Sum = 0 Sum = 0

Recently the mean tuition at 4-yr public collegesuniversities in the US was $6185 with a standard deviation of $1804 In NC the mean tuition was $4320 What is NCrsquos z-score

1 103

2 -103

3 239

4 1865

5 -1865

Section 34Measures of Position (also called Measures of Relative Standing)

Quartiles

5-Number Summary

Interquartile Range Another Measure of Spread

Boxplots

m = median = 34

Q1= first quartile = 23

Q3= third quartile = 42

1 1 062 2 123 3 164 4 195 5 156 6 217 7 238 6 239 5 2510 4 2811 3 2912 2 3313 1 3414 2 3615 3 3716 4 3817 5 3918 6 4119 7 4220 6 4521 5 4722 4 4923 3 5324 2 5625 1 61

Quartiles Measuring spread by examining the middleThe first quartile Q1 is the value in the

sample that has 25 of the data at or

below it (Q1 is the median of the lower

half of the sorted data)

The third quartile Q3 is the value in the

sample that has 75 of the data at or

below it (Q3 is the median of the upper

half of the sorted data)

Quartiles and median divide data into 4 pieces

Q1 M Q3

14 14 14 14

Quartiles are common measures of spread

httpoirpncsueduiradmit

httpoirpncsueduunivpeer

University of Southern California

Economic Value of College Majors

Rules for Calculating QuartilesStep 1 find the median of all the data (the median divides the data in half)

Step 2a find the median of the lower half this median is Q1Step 2b find the median of the upper half this median is Q3

Importantwhen n is odd include the overall median in both halveswhen n is even do not include the overall median in either half

Example 2 4 6 8 10 12 14 16 18 20 n = 10

Median m = (10+12)2 = 222 = 11

Q1 median of lower half 2 4 6 8 10

Q1 = 6

Q3 median of upper half 12 14 16 18 20

Q3 = 16

11

Pulse Rates n = 138

Stem Leaves4

3 4 5889 5 00123344410 5 555678889923 6 0001111112223333334444423 6 5555666666777778888888816 7 0000011222233444423 7 5555566666677788888899910 8 000011222410 8 55556677894 9 00122 9 584 10 0223

101 11 1

Median mean of pulses in locations 69 amp 70 median= (70+70)2=70

Q1 median of lower half (lower half = 69 smallest pulses) Q1 = pulse in ordered position 35Q1 = 63

Q3 median of upper half (upper half = 69 largest pulses) Q3= pulse in position 35 from the high end Q3=78

Below are the weights of 31 linemen on the NCSU football team What is the

value of the first quartile Q1

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 287

2 2575

3 2635

4 2625

Interquartile range another measure of spread

lower quartile Q1

middle quartile median upper quartile Q3

interquartile range (IQR)

IQR = Q3 ndash Q1

measures spread of middle 50 of the data

Example beginning pulse rates

Q3 = 78 Q1 = 63

IQR = 78 ndash 63 = 15

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 93: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Section 33 (cont)Using the Mean and Standard

Deviation Together68-95-997 rule

(also called the Empirical Rule)

z-scores

Preceding slides Next

Z-scores Standardized Data Values

Measures the distance of a number from the mean in units of

the standard deviation

z-score corresponding to y

where

original data value

the sample mean

s the sample standard deviation

the z-score corresponding to

y yz

s

y

y

z y

Exam 1 y1 = 88 s1 = 6 your exam 1 score 91

Exam 2 y2 = 88 s2 = 10 your exam 2 score 92

Which score is better

1

2

91 88 3z 5

6 692 88 4

z 410 10

91 on exam 1 is better than 92 on exam 2

If data has mean and standard deviation

then standardizing a particular value of

indicates how many standard deviations

is above or below the mean

y s

y

y

y

Comparing SAT and ACT Scores

SAT Math Eleanorrsquos score 680

SAT mean =500 sd=100 ACT Math Geraldrsquos score 27

ACT mean=18 sd=6 Eleanorrsquos z-score z=(680-500)100=18 Geraldrsquos z-score z=(27-18)6=15 Eleanorrsquos score is better

Z-scores add to zeroStudentInstitutional Support to Athletic Depts For the 9 Public ACC

Schools 2013 ($ millions)

School Support y - ybar Z-score

Maryland 155 64 179

UVA 131 40 112

Louisville 109 18 050

UNC 92 01 003

VaTech 79 -12 -034

FSU 79 -12 -034

GaTech 71 -20 -056

NCSU 65 -26 -073

Clemson 38 -53 -147

Mean=91000 s=35697

Sum = 0 Sum = 0

Recently the mean tuition at 4-yr public collegesuniversities in the US was $6185 with a standard deviation of $1804 In NC the mean tuition was $4320 What is NCrsquos z-score

1 103

2 -103

3 239

4 1865

5 -1865

Section 34Measures of Position (also called Measures of Relative Standing)

Quartiles

5-Number Summary

Interquartile Range Another Measure of Spread

Boxplots

m = median = 34

Q1= first quartile = 23

Q3= third quartile = 42

1 1 062 2 123 3 164 4 195 5 156 6 217 7 238 6 239 5 2510 4 2811 3 2912 2 3313 1 3414 2 3615 3 3716 4 3817 5 3918 6 4119 7 4220 6 4521 5 4722 4 4923 3 5324 2 5625 1 61

Quartiles Measuring spread by examining the middleThe first quartile Q1 is the value in the

sample that has 25 of the data at or

below it (Q1 is the median of the lower

half of the sorted data)

The third quartile Q3 is the value in the

sample that has 75 of the data at or

below it (Q3 is the median of the upper

half of the sorted data)

Quartiles and median divide data into 4 pieces

Q1 M Q3

14 14 14 14

Quartiles are common measures of spread

httpoirpncsueduiradmit

httpoirpncsueduunivpeer

University of Southern California

Economic Value of College Majors

Rules for Calculating QuartilesStep 1 find the median of all the data (the median divides the data in half)

Step 2a find the median of the lower half this median is Q1Step 2b find the median of the upper half this median is Q3

Importantwhen n is odd include the overall median in both halveswhen n is even do not include the overall median in either half

Example 2 4 6 8 10 12 14 16 18 20 n = 10

Median m = (10+12)2 = 222 = 11

Q1 median of lower half 2 4 6 8 10

Q1 = 6

Q3 median of upper half 12 14 16 18 20

Q3 = 16

11

Pulse Rates n = 138

Stem Leaves4

3 4 5889 5 00123344410 5 555678889923 6 0001111112223333334444423 6 5555666666777778888888816 7 0000011222233444423 7 5555566666677788888899910 8 000011222410 8 55556677894 9 00122 9 584 10 0223

101 11 1

Median mean of pulses in locations 69 amp 70 median= (70+70)2=70

Q1 median of lower half (lower half = 69 smallest pulses) Q1 = pulse in ordered position 35Q1 = 63

Q3 median of upper half (upper half = 69 largest pulses) Q3= pulse in position 35 from the high end Q3=78

Below are the weights of 31 linemen on the NCSU football team What is the

value of the first quartile Q1

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 287

2 2575

3 2635

4 2625

Interquartile range another measure of spread

lower quartile Q1

middle quartile median upper quartile Q3

interquartile range (IQR)

IQR = Q3 ndash Q1

measures spread of middle 50 of the data

Example beginning pulse rates

Q3 = 78 Q1 = 63

IQR = 78 ndash 63 = 15

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 94: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Z-scores Standardized Data Values

Measures the distance of a number from the mean in units of

the standard deviation

z-score corresponding to y

where

original data value

the sample mean

s the sample standard deviation

the z-score corresponding to

y yz

s

y

y

z y

Exam 1 y1 = 88 s1 = 6 your exam 1 score 91

Exam 2 y2 = 88 s2 = 10 your exam 2 score 92

Which score is better

1

2

91 88 3z 5

6 692 88 4

z 410 10

91 on exam 1 is better than 92 on exam 2

If data has mean and standard deviation

then standardizing a particular value of

indicates how many standard deviations

is above or below the mean

y s

y

y

y

Comparing SAT and ACT Scores

SAT Math Eleanorrsquos score 680

SAT mean =500 sd=100 ACT Math Geraldrsquos score 27

ACT mean=18 sd=6 Eleanorrsquos z-score z=(680-500)100=18 Geraldrsquos z-score z=(27-18)6=15 Eleanorrsquos score is better

Z-scores add to zeroStudentInstitutional Support to Athletic Depts For the 9 Public ACC

Schools 2013 ($ millions)

School Support y - ybar Z-score

Maryland 155 64 179

UVA 131 40 112

Louisville 109 18 050

UNC 92 01 003

VaTech 79 -12 -034

FSU 79 -12 -034

GaTech 71 -20 -056

NCSU 65 -26 -073

Clemson 38 -53 -147

Mean=91000 s=35697

Sum = 0 Sum = 0

Recently the mean tuition at 4-yr public collegesuniversities in the US was $6185 with a standard deviation of $1804 In NC the mean tuition was $4320 What is NCrsquos z-score

1 103

2 -103

3 239

4 1865

5 -1865

Section 34Measures of Position (also called Measures of Relative Standing)

Quartiles

5-Number Summary

Interquartile Range Another Measure of Spread

Boxplots

m = median = 34

Q1= first quartile = 23

Q3= third quartile = 42

1 1 062 2 123 3 164 4 195 5 156 6 217 7 238 6 239 5 2510 4 2811 3 2912 2 3313 1 3414 2 3615 3 3716 4 3817 5 3918 6 4119 7 4220 6 4521 5 4722 4 4923 3 5324 2 5625 1 61

Quartiles Measuring spread by examining the middleThe first quartile Q1 is the value in the

sample that has 25 of the data at or

below it (Q1 is the median of the lower

half of the sorted data)

The third quartile Q3 is the value in the

sample that has 75 of the data at or

below it (Q3 is the median of the upper

half of the sorted data)

Quartiles and median divide data into 4 pieces

Q1 M Q3

14 14 14 14

Quartiles are common measures of spread

httpoirpncsueduiradmit

httpoirpncsueduunivpeer

University of Southern California

Economic Value of College Majors

Rules for Calculating QuartilesStep 1 find the median of all the data (the median divides the data in half)

Step 2a find the median of the lower half this median is Q1Step 2b find the median of the upper half this median is Q3

Importantwhen n is odd include the overall median in both halveswhen n is even do not include the overall median in either half

Example 2 4 6 8 10 12 14 16 18 20 n = 10

Median m = (10+12)2 = 222 = 11

Q1 median of lower half 2 4 6 8 10

Q1 = 6

Q3 median of upper half 12 14 16 18 20

Q3 = 16

11

Pulse Rates n = 138

Stem Leaves4

3 4 5889 5 00123344410 5 555678889923 6 0001111112223333334444423 6 5555666666777778888888816 7 0000011222233444423 7 5555566666677788888899910 8 000011222410 8 55556677894 9 00122 9 584 10 0223

101 11 1

Median mean of pulses in locations 69 amp 70 median= (70+70)2=70

Q1 median of lower half (lower half = 69 smallest pulses) Q1 = pulse in ordered position 35Q1 = 63

Q3 median of upper half (upper half = 69 largest pulses) Q3= pulse in position 35 from the high end Q3=78

Below are the weights of 31 linemen on the NCSU football team What is the

value of the first quartile Q1

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 287

2 2575

3 2635

4 2625

Interquartile range another measure of spread

lower quartile Q1

middle quartile median upper quartile Q3

interquartile range (IQR)

IQR = Q3 ndash Q1

measures spread of middle 50 of the data

Example beginning pulse rates

Q3 = 78 Q1 = 63

IQR = 78 ndash 63 = 15

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 95: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

z-score corresponding to y

where

original data value

the sample mean

s the sample standard deviation

the z-score corresponding to

y yz

s

y

y

z y

Exam 1 y1 = 88 s1 = 6 your exam 1 score 91

Exam 2 y2 = 88 s2 = 10 your exam 2 score 92

Which score is better

1

2

91 88 3z 5

6 692 88 4

z 410 10

91 on exam 1 is better than 92 on exam 2

If data has mean and standard deviation

then standardizing a particular value of

indicates how many standard deviations

is above or below the mean

y s

y

y

y

Comparing SAT and ACT Scores

SAT Math Eleanorrsquos score 680

SAT mean =500 sd=100 ACT Math Geraldrsquos score 27

ACT mean=18 sd=6 Eleanorrsquos z-score z=(680-500)100=18 Geraldrsquos z-score z=(27-18)6=15 Eleanorrsquos score is better

Z-scores add to zeroStudentInstitutional Support to Athletic Depts For the 9 Public ACC

Schools 2013 ($ millions)

School Support y - ybar Z-score

Maryland 155 64 179

UVA 131 40 112

Louisville 109 18 050

UNC 92 01 003

VaTech 79 -12 -034

FSU 79 -12 -034

GaTech 71 -20 -056

NCSU 65 -26 -073

Clemson 38 -53 -147

Mean=91000 s=35697

Sum = 0 Sum = 0

Recently the mean tuition at 4-yr public collegesuniversities in the US was $6185 with a standard deviation of $1804 In NC the mean tuition was $4320 What is NCrsquos z-score

1 103

2 -103

3 239

4 1865

5 -1865

Section 34Measures of Position (also called Measures of Relative Standing)

Quartiles

5-Number Summary

Interquartile Range Another Measure of Spread

Boxplots

m = median = 34

Q1= first quartile = 23

Q3= third quartile = 42

1 1 062 2 123 3 164 4 195 5 156 6 217 7 238 6 239 5 2510 4 2811 3 2912 2 3313 1 3414 2 3615 3 3716 4 3817 5 3918 6 4119 7 4220 6 4521 5 4722 4 4923 3 5324 2 5625 1 61

Quartiles Measuring spread by examining the middleThe first quartile Q1 is the value in the

sample that has 25 of the data at or

below it (Q1 is the median of the lower

half of the sorted data)

The third quartile Q3 is the value in the

sample that has 75 of the data at or

below it (Q3 is the median of the upper

half of the sorted data)

Quartiles and median divide data into 4 pieces

Q1 M Q3

14 14 14 14

Quartiles are common measures of spread

httpoirpncsueduiradmit

httpoirpncsueduunivpeer

University of Southern California

Economic Value of College Majors

Rules for Calculating QuartilesStep 1 find the median of all the data (the median divides the data in half)

Step 2a find the median of the lower half this median is Q1Step 2b find the median of the upper half this median is Q3

Importantwhen n is odd include the overall median in both halveswhen n is even do not include the overall median in either half

Example 2 4 6 8 10 12 14 16 18 20 n = 10

Median m = (10+12)2 = 222 = 11

Q1 median of lower half 2 4 6 8 10

Q1 = 6

Q3 median of upper half 12 14 16 18 20

Q3 = 16

11

Pulse Rates n = 138

Stem Leaves4

3 4 5889 5 00123344410 5 555678889923 6 0001111112223333334444423 6 5555666666777778888888816 7 0000011222233444423 7 5555566666677788888899910 8 000011222410 8 55556677894 9 00122 9 584 10 0223

101 11 1

Median mean of pulses in locations 69 amp 70 median= (70+70)2=70

Q1 median of lower half (lower half = 69 smallest pulses) Q1 = pulse in ordered position 35Q1 = 63

Q3 median of upper half (upper half = 69 largest pulses) Q3= pulse in position 35 from the high end Q3=78

Below are the weights of 31 linemen on the NCSU football team What is the

value of the first quartile Q1

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 287

2 2575

3 2635

4 2625

Interquartile range another measure of spread

lower quartile Q1

middle quartile median upper quartile Q3

interquartile range (IQR)

IQR = Q3 ndash Q1

measures spread of middle 50 of the data

Example beginning pulse rates

Q3 = 78 Q1 = 63

IQR = 78 ndash 63 = 15

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 96: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Exam 1 y1 = 88 s1 = 6 your exam 1 score 91

Exam 2 y2 = 88 s2 = 10 your exam 2 score 92

Which score is better

1

2

91 88 3z 5

6 692 88 4

z 410 10

91 on exam 1 is better than 92 on exam 2

If data has mean and standard deviation

then standardizing a particular value of

indicates how many standard deviations

is above or below the mean

y s

y

y

y

Comparing SAT and ACT Scores

SAT Math Eleanorrsquos score 680

SAT mean =500 sd=100 ACT Math Geraldrsquos score 27

ACT mean=18 sd=6 Eleanorrsquos z-score z=(680-500)100=18 Geraldrsquos z-score z=(27-18)6=15 Eleanorrsquos score is better

Z-scores add to zeroStudentInstitutional Support to Athletic Depts For the 9 Public ACC

Schools 2013 ($ millions)

School Support y - ybar Z-score

Maryland 155 64 179

UVA 131 40 112

Louisville 109 18 050

UNC 92 01 003

VaTech 79 -12 -034

FSU 79 -12 -034

GaTech 71 -20 -056

NCSU 65 -26 -073

Clemson 38 -53 -147

Mean=91000 s=35697

Sum = 0 Sum = 0

Recently the mean tuition at 4-yr public collegesuniversities in the US was $6185 with a standard deviation of $1804 In NC the mean tuition was $4320 What is NCrsquos z-score

1 103

2 -103

3 239

4 1865

5 -1865

Section 34Measures of Position (also called Measures of Relative Standing)

Quartiles

5-Number Summary

Interquartile Range Another Measure of Spread

Boxplots

m = median = 34

Q1= first quartile = 23

Q3= third quartile = 42

1 1 062 2 123 3 164 4 195 5 156 6 217 7 238 6 239 5 2510 4 2811 3 2912 2 3313 1 3414 2 3615 3 3716 4 3817 5 3918 6 4119 7 4220 6 4521 5 4722 4 4923 3 5324 2 5625 1 61

Quartiles Measuring spread by examining the middleThe first quartile Q1 is the value in the

sample that has 25 of the data at or

below it (Q1 is the median of the lower

half of the sorted data)

The third quartile Q3 is the value in the

sample that has 75 of the data at or

below it (Q3 is the median of the upper

half of the sorted data)

Quartiles and median divide data into 4 pieces

Q1 M Q3

14 14 14 14

Quartiles are common measures of spread

httpoirpncsueduiradmit

httpoirpncsueduunivpeer

University of Southern California

Economic Value of College Majors

Rules for Calculating QuartilesStep 1 find the median of all the data (the median divides the data in half)

Step 2a find the median of the lower half this median is Q1Step 2b find the median of the upper half this median is Q3

Importantwhen n is odd include the overall median in both halveswhen n is even do not include the overall median in either half

Example 2 4 6 8 10 12 14 16 18 20 n = 10

Median m = (10+12)2 = 222 = 11

Q1 median of lower half 2 4 6 8 10

Q1 = 6

Q3 median of upper half 12 14 16 18 20

Q3 = 16

11

Pulse Rates n = 138

Stem Leaves4

3 4 5889 5 00123344410 5 555678889923 6 0001111112223333334444423 6 5555666666777778888888816 7 0000011222233444423 7 5555566666677788888899910 8 000011222410 8 55556677894 9 00122 9 584 10 0223

101 11 1

Median mean of pulses in locations 69 amp 70 median= (70+70)2=70

Q1 median of lower half (lower half = 69 smallest pulses) Q1 = pulse in ordered position 35Q1 = 63

Q3 median of upper half (upper half = 69 largest pulses) Q3= pulse in position 35 from the high end Q3=78

Below are the weights of 31 linemen on the NCSU football team What is the

value of the first quartile Q1

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 287

2 2575

3 2635

4 2625

Interquartile range another measure of spread

lower quartile Q1

middle quartile median upper quartile Q3

interquartile range (IQR)

IQR = Q3 ndash Q1

measures spread of middle 50 of the data

Example beginning pulse rates

Q3 = 78 Q1 = 63

IQR = 78 ndash 63 = 15

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 97: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Comparing SAT and ACT Scores

SAT Math Eleanorrsquos score 680

SAT mean =500 sd=100 ACT Math Geraldrsquos score 27

ACT mean=18 sd=6 Eleanorrsquos z-score z=(680-500)100=18 Geraldrsquos z-score z=(27-18)6=15 Eleanorrsquos score is better

Z-scores add to zeroStudentInstitutional Support to Athletic Depts For the 9 Public ACC

Schools 2013 ($ millions)

School Support y - ybar Z-score

Maryland 155 64 179

UVA 131 40 112

Louisville 109 18 050

UNC 92 01 003

VaTech 79 -12 -034

FSU 79 -12 -034

GaTech 71 -20 -056

NCSU 65 -26 -073

Clemson 38 -53 -147

Mean=91000 s=35697

Sum = 0 Sum = 0

Recently the mean tuition at 4-yr public collegesuniversities in the US was $6185 with a standard deviation of $1804 In NC the mean tuition was $4320 What is NCrsquos z-score

1 103

2 -103

3 239

4 1865

5 -1865

Section 34Measures of Position (also called Measures of Relative Standing)

Quartiles

5-Number Summary

Interquartile Range Another Measure of Spread

Boxplots

m = median = 34

Q1= first quartile = 23

Q3= third quartile = 42

1 1 062 2 123 3 164 4 195 5 156 6 217 7 238 6 239 5 2510 4 2811 3 2912 2 3313 1 3414 2 3615 3 3716 4 3817 5 3918 6 4119 7 4220 6 4521 5 4722 4 4923 3 5324 2 5625 1 61

Quartiles Measuring spread by examining the middleThe first quartile Q1 is the value in the

sample that has 25 of the data at or

below it (Q1 is the median of the lower

half of the sorted data)

The third quartile Q3 is the value in the

sample that has 75 of the data at or

below it (Q3 is the median of the upper

half of the sorted data)

Quartiles and median divide data into 4 pieces

Q1 M Q3

14 14 14 14

Quartiles are common measures of spread

httpoirpncsueduiradmit

httpoirpncsueduunivpeer

University of Southern California

Economic Value of College Majors

Rules for Calculating QuartilesStep 1 find the median of all the data (the median divides the data in half)

Step 2a find the median of the lower half this median is Q1Step 2b find the median of the upper half this median is Q3

Importantwhen n is odd include the overall median in both halveswhen n is even do not include the overall median in either half

Example 2 4 6 8 10 12 14 16 18 20 n = 10

Median m = (10+12)2 = 222 = 11

Q1 median of lower half 2 4 6 8 10

Q1 = 6

Q3 median of upper half 12 14 16 18 20

Q3 = 16

11

Pulse Rates n = 138

Stem Leaves4

3 4 5889 5 00123344410 5 555678889923 6 0001111112223333334444423 6 5555666666777778888888816 7 0000011222233444423 7 5555566666677788888899910 8 000011222410 8 55556677894 9 00122 9 584 10 0223

101 11 1

Median mean of pulses in locations 69 amp 70 median= (70+70)2=70

Q1 median of lower half (lower half = 69 smallest pulses) Q1 = pulse in ordered position 35Q1 = 63

Q3 median of upper half (upper half = 69 largest pulses) Q3= pulse in position 35 from the high end Q3=78

Below are the weights of 31 linemen on the NCSU football team What is the

value of the first quartile Q1

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 287

2 2575

3 2635

4 2625

Interquartile range another measure of spread

lower quartile Q1

middle quartile median upper quartile Q3

interquartile range (IQR)

IQR = Q3 ndash Q1

measures spread of middle 50 of the data

Example beginning pulse rates

Q3 = 78 Q1 = 63

IQR = 78 ndash 63 = 15

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 98: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Z-scores add to zeroStudentInstitutional Support to Athletic Depts For the 9 Public ACC

Schools 2013 ($ millions)

School Support y - ybar Z-score

Maryland 155 64 179

UVA 131 40 112

Louisville 109 18 050

UNC 92 01 003

VaTech 79 -12 -034

FSU 79 -12 -034

GaTech 71 -20 -056

NCSU 65 -26 -073

Clemson 38 -53 -147

Mean=91000 s=35697

Sum = 0 Sum = 0

Recently the mean tuition at 4-yr public collegesuniversities in the US was $6185 with a standard deviation of $1804 In NC the mean tuition was $4320 What is NCrsquos z-score

1 103

2 -103

3 239

4 1865

5 -1865

Section 34Measures of Position (also called Measures of Relative Standing)

Quartiles

5-Number Summary

Interquartile Range Another Measure of Spread

Boxplots

m = median = 34

Q1= first quartile = 23

Q3= third quartile = 42

1 1 062 2 123 3 164 4 195 5 156 6 217 7 238 6 239 5 2510 4 2811 3 2912 2 3313 1 3414 2 3615 3 3716 4 3817 5 3918 6 4119 7 4220 6 4521 5 4722 4 4923 3 5324 2 5625 1 61

Quartiles Measuring spread by examining the middleThe first quartile Q1 is the value in the

sample that has 25 of the data at or

below it (Q1 is the median of the lower

half of the sorted data)

The third quartile Q3 is the value in the

sample that has 75 of the data at or

below it (Q3 is the median of the upper

half of the sorted data)

Quartiles and median divide data into 4 pieces

Q1 M Q3

14 14 14 14

Quartiles are common measures of spread

httpoirpncsueduiradmit

httpoirpncsueduunivpeer

University of Southern California

Economic Value of College Majors

Rules for Calculating QuartilesStep 1 find the median of all the data (the median divides the data in half)

Step 2a find the median of the lower half this median is Q1Step 2b find the median of the upper half this median is Q3

Importantwhen n is odd include the overall median in both halveswhen n is even do not include the overall median in either half

Example 2 4 6 8 10 12 14 16 18 20 n = 10

Median m = (10+12)2 = 222 = 11

Q1 median of lower half 2 4 6 8 10

Q1 = 6

Q3 median of upper half 12 14 16 18 20

Q3 = 16

11

Pulse Rates n = 138

Stem Leaves4

3 4 5889 5 00123344410 5 555678889923 6 0001111112223333334444423 6 5555666666777778888888816 7 0000011222233444423 7 5555566666677788888899910 8 000011222410 8 55556677894 9 00122 9 584 10 0223

101 11 1

Median mean of pulses in locations 69 amp 70 median= (70+70)2=70

Q1 median of lower half (lower half = 69 smallest pulses) Q1 = pulse in ordered position 35Q1 = 63

Q3 median of upper half (upper half = 69 largest pulses) Q3= pulse in position 35 from the high end Q3=78

Below are the weights of 31 linemen on the NCSU football team What is the

value of the first quartile Q1

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 287

2 2575

3 2635

4 2625

Interquartile range another measure of spread

lower quartile Q1

middle quartile median upper quartile Q3

interquartile range (IQR)

IQR = Q3 ndash Q1

measures spread of middle 50 of the data

Example beginning pulse rates

Q3 = 78 Q1 = 63

IQR = 78 ndash 63 = 15

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 99: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Recently the mean tuition at 4-yr public collegesuniversities in the US was $6185 with a standard deviation of $1804 In NC the mean tuition was $4320 What is NCrsquos z-score

1 103

2 -103

3 239

4 1865

5 -1865

Section 34Measures of Position (also called Measures of Relative Standing)

Quartiles

5-Number Summary

Interquartile Range Another Measure of Spread

Boxplots

m = median = 34

Q1= first quartile = 23

Q3= third quartile = 42

1 1 062 2 123 3 164 4 195 5 156 6 217 7 238 6 239 5 2510 4 2811 3 2912 2 3313 1 3414 2 3615 3 3716 4 3817 5 3918 6 4119 7 4220 6 4521 5 4722 4 4923 3 5324 2 5625 1 61

Quartiles Measuring spread by examining the middleThe first quartile Q1 is the value in the

sample that has 25 of the data at or

below it (Q1 is the median of the lower

half of the sorted data)

The third quartile Q3 is the value in the

sample that has 75 of the data at or

below it (Q3 is the median of the upper

half of the sorted data)

Quartiles and median divide data into 4 pieces

Q1 M Q3

14 14 14 14

Quartiles are common measures of spread

httpoirpncsueduiradmit

httpoirpncsueduunivpeer

University of Southern California

Economic Value of College Majors

Rules for Calculating QuartilesStep 1 find the median of all the data (the median divides the data in half)

Step 2a find the median of the lower half this median is Q1Step 2b find the median of the upper half this median is Q3

Importantwhen n is odd include the overall median in both halveswhen n is even do not include the overall median in either half

Example 2 4 6 8 10 12 14 16 18 20 n = 10

Median m = (10+12)2 = 222 = 11

Q1 median of lower half 2 4 6 8 10

Q1 = 6

Q3 median of upper half 12 14 16 18 20

Q3 = 16

11

Pulse Rates n = 138

Stem Leaves4

3 4 5889 5 00123344410 5 555678889923 6 0001111112223333334444423 6 5555666666777778888888816 7 0000011222233444423 7 5555566666677788888899910 8 000011222410 8 55556677894 9 00122 9 584 10 0223

101 11 1

Median mean of pulses in locations 69 amp 70 median= (70+70)2=70

Q1 median of lower half (lower half = 69 smallest pulses) Q1 = pulse in ordered position 35Q1 = 63

Q3 median of upper half (upper half = 69 largest pulses) Q3= pulse in position 35 from the high end Q3=78

Below are the weights of 31 linemen on the NCSU football team What is the

value of the first quartile Q1

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 287

2 2575

3 2635

4 2625

Interquartile range another measure of spread

lower quartile Q1

middle quartile median upper quartile Q3

interquartile range (IQR)

IQR = Q3 ndash Q1

measures spread of middle 50 of the data

Example beginning pulse rates

Q3 = 78 Q1 = 63

IQR = 78 ndash 63 = 15

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 100: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Section 34Measures of Position (also called Measures of Relative Standing)

Quartiles

5-Number Summary

Interquartile Range Another Measure of Spread

Boxplots

m = median = 34

Q1= first quartile = 23

Q3= third quartile = 42

1 1 062 2 123 3 164 4 195 5 156 6 217 7 238 6 239 5 2510 4 2811 3 2912 2 3313 1 3414 2 3615 3 3716 4 3817 5 3918 6 4119 7 4220 6 4521 5 4722 4 4923 3 5324 2 5625 1 61

Quartiles Measuring spread by examining the middleThe first quartile Q1 is the value in the

sample that has 25 of the data at or

below it (Q1 is the median of the lower

half of the sorted data)

The third quartile Q3 is the value in the

sample that has 75 of the data at or

below it (Q3 is the median of the upper

half of the sorted data)

Quartiles and median divide data into 4 pieces

Q1 M Q3

14 14 14 14

Quartiles are common measures of spread

httpoirpncsueduiradmit

httpoirpncsueduunivpeer

University of Southern California

Economic Value of College Majors

Rules for Calculating QuartilesStep 1 find the median of all the data (the median divides the data in half)

Step 2a find the median of the lower half this median is Q1Step 2b find the median of the upper half this median is Q3

Importantwhen n is odd include the overall median in both halveswhen n is even do not include the overall median in either half

Example 2 4 6 8 10 12 14 16 18 20 n = 10

Median m = (10+12)2 = 222 = 11

Q1 median of lower half 2 4 6 8 10

Q1 = 6

Q3 median of upper half 12 14 16 18 20

Q3 = 16

11

Pulse Rates n = 138

Stem Leaves4

3 4 5889 5 00123344410 5 555678889923 6 0001111112223333334444423 6 5555666666777778888888816 7 0000011222233444423 7 5555566666677788888899910 8 000011222410 8 55556677894 9 00122 9 584 10 0223

101 11 1

Median mean of pulses in locations 69 amp 70 median= (70+70)2=70

Q1 median of lower half (lower half = 69 smallest pulses) Q1 = pulse in ordered position 35Q1 = 63

Q3 median of upper half (upper half = 69 largest pulses) Q3= pulse in position 35 from the high end Q3=78

Below are the weights of 31 linemen on the NCSU football team What is the

value of the first quartile Q1

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 287

2 2575

3 2635

4 2625

Interquartile range another measure of spread

lower quartile Q1

middle quartile median upper quartile Q3

interquartile range (IQR)

IQR = Q3 ndash Q1

measures spread of middle 50 of the data

Example beginning pulse rates

Q3 = 78 Q1 = 63

IQR = 78 ndash 63 = 15

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 101: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

m = median = 34

Q1= first quartile = 23

Q3= third quartile = 42

1 1 062 2 123 3 164 4 195 5 156 6 217 7 238 6 239 5 2510 4 2811 3 2912 2 3313 1 3414 2 3615 3 3716 4 3817 5 3918 6 4119 7 4220 6 4521 5 4722 4 4923 3 5324 2 5625 1 61

Quartiles Measuring spread by examining the middleThe first quartile Q1 is the value in the

sample that has 25 of the data at or

below it (Q1 is the median of the lower

half of the sorted data)

The third quartile Q3 is the value in the

sample that has 75 of the data at or

below it (Q3 is the median of the upper

half of the sorted data)

Quartiles and median divide data into 4 pieces

Q1 M Q3

14 14 14 14

Quartiles are common measures of spread

httpoirpncsueduiradmit

httpoirpncsueduunivpeer

University of Southern California

Economic Value of College Majors

Rules for Calculating QuartilesStep 1 find the median of all the data (the median divides the data in half)

Step 2a find the median of the lower half this median is Q1Step 2b find the median of the upper half this median is Q3

Importantwhen n is odd include the overall median in both halveswhen n is even do not include the overall median in either half

Example 2 4 6 8 10 12 14 16 18 20 n = 10

Median m = (10+12)2 = 222 = 11

Q1 median of lower half 2 4 6 8 10

Q1 = 6

Q3 median of upper half 12 14 16 18 20

Q3 = 16

11

Pulse Rates n = 138

Stem Leaves4

3 4 5889 5 00123344410 5 555678889923 6 0001111112223333334444423 6 5555666666777778888888816 7 0000011222233444423 7 5555566666677788888899910 8 000011222410 8 55556677894 9 00122 9 584 10 0223

101 11 1

Median mean of pulses in locations 69 amp 70 median= (70+70)2=70

Q1 median of lower half (lower half = 69 smallest pulses) Q1 = pulse in ordered position 35Q1 = 63

Q3 median of upper half (upper half = 69 largest pulses) Q3= pulse in position 35 from the high end Q3=78

Below are the weights of 31 linemen on the NCSU football team What is the

value of the first quartile Q1

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 287

2 2575

3 2635

4 2625

Interquartile range another measure of spread

lower quartile Q1

middle quartile median upper quartile Q3

interquartile range (IQR)

IQR = Q3 ndash Q1

measures spread of middle 50 of the data

Example beginning pulse rates

Q3 = 78 Q1 = 63

IQR = 78 ndash 63 = 15

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 102: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Quartiles and median divide data into 4 pieces

Q1 M Q3

14 14 14 14

Quartiles are common measures of spread

httpoirpncsueduiradmit

httpoirpncsueduunivpeer

University of Southern California

Economic Value of College Majors

Rules for Calculating QuartilesStep 1 find the median of all the data (the median divides the data in half)

Step 2a find the median of the lower half this median is Q1Step 2b find the median of the upper half this median is Q3

Importantwhen n is odd include the overall median in both halveswhen n is even do not include the overall median in either half

Example 2 4 6 8 10 12 14 16 18 20 n = 10

Median m = (10+12)2 = 222 = 11

Q1 median of lower half 2 4 6 8 10

Q1 = 6

Q3 median of upper half 12 14 16 18 20

Q3 = 16

11

Pulse Rates n = 138

Stem Leaves4

3 4 5889 5 00123344410 5 555678889923 6 0001111112223333334444423 6 5555666666777778888888816 7 0000011222233444423 7 5555566666677788888899910 8 000011222410 8 55556677894 9 00122 9 584 10 0223

101 11 1

Median mean of pulses in locations 69 amp 70 median= (70+70)2=70

Q1 median of lower half (lower half = 69 smallest pulses) Q1 = pulse in ordered position 35Q1 = 63

Q3 median of upper half (upper half = 69 largest pulses) Q3= pulse in position 35 from the high end Q3=78

Below are the weights of 31 linemen on the NCSU football team What is the

value of the first quartile Q1

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 287

2 2575

3 2635

4 2625

Interquartile range another measure of spread

lower quartile Q1

middle quartile median upper quartile Q3

interquartile range (IQR)

IQR = Q3 ndash Q1

measures spread of middle 50 of the data

Example beginning pulse rates

Q3 = 78 Q1 = 63

IQR = 78 ndash 63 = 15

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 103: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Quartiles are common measures of spread

httpoirpncsueduiradmit

httpoirpncsueduunivpeer

University of Southern California

Economic Value of College Majors

Rules for Calculating QuartilesStep 1 find the median of all the data (the median divides the data in half)

Step 2a find the median of the lower half this median is Q1Step 2b find the median of the upper half this median is Q3

Importantwhen n is odd include the overall median in both halveswhen n is even do not include the overall median in either half

Example 2 4 6 8 10 12 14 16 18 20 n = 10

Median m = (10+12)2 = 222 = 11

Q1 median of lower half 2 4 6 8 10

Q1 = 6

Q3 median of upper half 12 14 16 18 20

Q3 = 16

11

Pulse Rates n = 138

Stem Leaves4

3 4 5889 5 00123344410 5 555678889923 6 0001111112223333334444423 6 5555666666777778888888816 7 0000011222233444423 7 5555566666677788888899910 8 000011222410 8 55556677894 9 00122 9 584 10 0223

101 11 1

Median mean of pulses in locations 69 amp 70 median= (70+70)2=70

Q1 median of lower half (lower half = 69 smallest pulses) Q1 = pulse in ordered position 35Q1 = 63

Q3 median of upper half (upper half = 69 largest pulses) Q3= pulse in position 35 from the high end Q3=78

Below are the weights of 31 linemen on the NCSU football team What is the

value of the first quartile Q1

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 287

2 2575

3 2635

4 2625

Interquartile range another measure of spread

lower quartile Q1

middle quartile median upper quartile Q3

interquartile range (IQR)

IQR = Q3 ndash Q1

measures spread of middle 50 of the data

Example beginning pulse rates

Q3 = 78 Q1 = 63

IQR = 78 ndash 63 = 15

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 104: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Rules for Calculating QuartilesStep 1 find the median of all the data (the median divides the data in half)

Step 2a find the median of the lower half this median is Q1Step 2b find the median of the upper half this median is Q3

Importantwhen n is odd include the overall median in both halveswhen n is even do not include the overall median in either half

Example 2 4 6 8 10 12 14 16 18 20 n = 10

Median m = (10+12)2 = 222 = 11

Q1 median of lower half 2 4 6 8 10

Q1 = 6

Q3 median of upper half 12 14 16 18 20

Q3 = 16

11

Pulse Rates n = 138

Stem Leaves4

3 4 5889 5 00123344410 5 555678889923 6 0001111112223333334444423 6 5555666666777778888888816 7 0000011222233444423 7 5555566666677788888899910 8 000011222410 8 55556677894 9 00122 9 584 10 0223

101 11 1

Median mean of pulses in locations 69 amp 70 median= (70+70)2=70

Q1 median of lower half (lower half = 69 smallest pulses) Q1 = pulse in ordered position 35Q1 = 63

Q3 median of upper half (upper half = 69 largest pulses) Q3= pulse in position 35 from the high end Q3=78

Below are the weights of 31 linemen on the NCSU football team What is the

value of the first quartile Q1

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 287

2 2575

3 2635

4 2625

Interquartile range another measure of spread

lower quartile Q1

middle quartile median upper quartile Q3

interquartile range (IQR)

IQR = Q3 ndash Q1

measures spread of middle 50 of the data

Example beginning pulse rates

Q3 = 78 Q1 = 63

IQR = 78 ndash 63 = 15

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 105: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Example 2 4 6 8 10 12 14 16 18 20 n = 10

Median m = (10+12)2 = 222 = 11

Q1 median of lower half 2 4 6 8 10

Q1 = 6

Q3 median of upper half 12 14 16 18 20

Q3 = 16

11

Pulse Rates n = 138

Stem Leaves4

3 4 5889 5 00123344410 5 555678889923 6 0001111112223333334444423 6 5555666666777778888888816 7 0000011222233444423 7 5555566666677788888899910 8 000011222410 8 55556677894 9 00122 9 584 10 0223

101 11 1

Median mean of pulses in locations 69 amp 70 median= (70+70)2=70

Q1 median of lower half (lower half = 69 smallest pulses) Q1 = pulse in ordered position 35Q1 = 63

Q3 median of upper half (upper half = 69 largest pulses) Q3= pulse in position 35 from the high end Q3=78

Below are the weights of 31 linemen on the NCSU football team What is the

value of the first quartile Q1

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 287

2 2575

3 2635

4 2625

Interquartile range another measure of spread

lower quartile Q1

middle quartile median upper quartile Q3

interquartile range (IQR)

IQR = Q3 ndash Q1

measures spread of middle 50 of the data

Example beginning pulse rates

Q3 = 78 Q1 = 63

IQR = 78 ndash 63 = 15

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 106: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Pulse Rates n = 138

Stem Leaves4

3 4 5889 5 00123344410 5 555678889923 6 0001111112223333334444423 6 5555666666777778888888816 7 0000011222233444423 7 5555566666677788888899910 8 000011222410 8 55556677894 9 00122 9 584 10 0223

101 11 1

Median mean of pulses in locations 69 amp 70 median= (70+70)2=70

Q1 median of lower half (lower half = 69 smallest pulses) Q1 = pulse in ordered position 35Q1 = 63

Q3 median of upper half (upper half = 69 largest pulses) Q3= pulse in position 35 from the high end Q3=78

Below are the weights of 31 linemen on the NCSU football team What is the

value of the first quartile Q1

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 287

2 2575

3 2635

4 2625

Interquartile range another measure of spread

lower quartile Q1

middle quartile median upper quartile Q3

interquartile range (IQR)

IQR = Q3 ndash Q1

measures spread of middle 50 of the data

Example beginning pulse rates

Q3 = 78 Q1 = 63

IQR = 78 ndash 63 = 15

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 107: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Below are the weights of 31 linemen on the NCSU football team What is the

value of the first quartile Q1

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 287

2 2575

3 2635

4 2625

Interquartile range another measure of spread

lower quartile Q1

middle quartile median upper quartile Q3

interquartile range (IQR)

IQR = Q3 ndash Q1

measures spread of middle 50 of the data

Example beginning pulse rates

Q3 = 78 Q1 = 63

IQR = 78 ndash 63 = 15

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 108: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Interquartile range another measure of spread

lower quartile Q1

middle quartile median upper quartile Q3

interquartile range (IQR)

IQR = Q3 ndash Q1

measures spread of middle 50 of the data

Example beginning pulse rates

Q3 = 78 Q1 = 63

IQR = 78 ndash 63 = 15

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 109: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Example beginning pulse rates

Q3 = 78 Q1 = 63

IQR = 78 ndash 63 = 15

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 110: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Below are the weights of 31 linemen on the NCSU football team The first quartile Q1 is 2635 What is the value of the IQR

stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1 235

2 395

3 46

4 695

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 111: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

5-number summary of data

Minimum Q1 median Q3 maximum

Example Pulse data

45 63 70 78 111

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 112: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

m = median = 34

Q3= third quartile = 42

Q1= first quartile = 23

25 1 6124 2 5623 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 61

Smallest = min = 06

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary

min Q1 m Q3 max

Boxplot display of 5-number summary

BOXPLOT

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 113: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Boxplot display of 5-number summary

Example age of 66 ldquocrushrdquo victims at rock concerts 2001-2010

5-number summary13 17 19 22 47

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 114: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Q3= third quartile = 42

Q1= first quartile = 23

25 1 7924 2 6123 3 5322 4 4921 5 4720 6 4519 7 4218 6 4117 5 3916 4 3815 3 3714 2 3613 1 3412 2 3311 3 2910 4 289 5 258 6 237 7 236 6 215 5 154 4 193 3 162 2 121 1 06

Largest = max = 79

Boxplot display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 ndash Q1=42 minus 23 =

19

Q3+15IQR=42+285 = 705

15 IQR = 1519=285 Individual 25 has a value of

79 years so 79 is an outlier The line from the top

end of the box is drawn to the biggest number in the

data that is less than 705

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 115: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

ATM Withdrawals by Day Month Holidays

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 116: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Beg of class pulses (n=138) Q1 = 63 Q3 = 78 IQR=78 63=15

15(IQR)=15(15)=225

Q1 - 15(IQR) 63 ndash 225=405

Q3 + 15(IQR) 78 + 225=1005

7063 78405 100545

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 117: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards What is the approximate value of Q3

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1 450

2 750

3 215

4 545

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 118: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Rock concert deaths histogram and boxplot

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 119: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Automating Boxplot Construction

Excel ldquoout of the boxrdquo does not draw boxplots

Many add-ins are available on the internet that give Excel the capability to draw box plots

Statcrunch (httpstatcrunchstatncsuedu) draws box plots

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 120: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Tuition 4-yr Colleges

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 121: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 122: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Basic Terminology Univariate data 1 variable is measured

on each sample unit or population unit For example height of each student in a sample

Bivariate data 2 variables are measured on each sample unit or population uniteg height and GPA of each student in a sample (caution data from 2 separate samples is not bivariate data)

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 123: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Contingency Tables for Bivariate Categorical Data

Example Survival and class on the Titanic

Crew First Second Third TotalAlive 212 202 118 178 710Dead 673 123 167 528 1491Total 885 325 285 706 2201

Marginal distributions marg dist of survival

7102201 323

14912201 677

marg dist of class

8852201 402

3252201 148

2852201 129

7062201 321

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 124: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Marginal distribution of classBar chart

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 125: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Marginal distribution of class Pie chart

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 126: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Contingency Tables for Bivariate Categorical Data - 2

Conditional distributionsGiven the class of a passenger what is the chance the passenger survived

ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 127: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Conditional distributions segmented bar chart

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 128: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Contingency Tables for Bivariate Categorical

Data - 3Questions What fraction of survivors were in first class What fraction of passengers were in first class and

survivors What fraction of the first class passengers

survived ClassCrew First Second Third Total

Alive Count 212 202 118 178 710Survival of col 240 622 414 252 323

Dead Count 673 123 167 528 1491 of col 760 378 586 748 677

Total Count 885 325 285 706 2201

202710

2022201

202325

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 129: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

TV viewers during the Super Bowl in 2013 What is the marginal distribution of those who watched the commercials only

1 80

2 235

3 582

4 277

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 130: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

TV viewers during the Super Bowl in 2013 What percentage watched the game and were female

1 418

2 388

3 512

4 198

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 131: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

TV viewers during the Super Bowl in 2013 Given that a viewer did not watch the Super Bowl telecast what percentage were male

1 452

2 488

3 268

4 277

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 132: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Section 35Bivariate Descriptive Statistics

Contingency Tables for Bivariate Categorical Data

Scatterplots and Correlation for Bivariate Quantitative Data

Previous slidesNext

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 133: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Student Beers Blood Alcohol

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Here we have two quantitative

variables for each of 16 students

1) How many beers

they drank and

2) Their blood alcohol

level (BAC)

We are interested in the

relationship between the

two variables How is

one affected by changes

in the other one

Scatterplots the most frequently used method to graphically describe the relationship between 2 quantitative variables

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 134: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Student Beers BAC

1 5 01

2 2 003

3 9 019

4 7 0095

5 3 007

6 3 002

7 4 007

8 5 0085

9 8 012

10 3 004

11 5 006

12 5 005

13 6 01

14 7 009

15 1 001

16 4 005

Scatterplot Blood Alcohol Content vs Number of Beers

In a scatterplot one axis is used to represent each of the

variables and the data are plotted as points on the graph

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 135: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Scatterplot Fuel Consumption vs Car

Weight x=car weight y=fuel cons (xi yi) (34 55) (38 59) (41 65) (22 33)(26 36) (29 46) (2 29) (27 36) (19 31) (34 49)

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 136: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

The correlation coefficient r is a measure of the direction and strength

of the linear relationship between 2 quantitative variables

The correlation coefficient r

Correlation can only be used to describe quantitative variables Categorical variables donrsquot have means and standard deviations

1

1

1

ni i

i x y

x x y yr

n s s

1 1 2 2bivariate data ( ) ( ) ( )n nx y x y x y

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 137: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

CorrelationFuel Consumption vs Car Weight

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

15 25 35 45

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P

(gal

100

mile

s)

r = 9766

1

1

1

ni i

i x y

x x y yr

n s s

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 138: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Propertiesr ranges from

-1 to+1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables

Strength how closely the points follow a straight line

Direction is positive when individuals with higher X values tend to have higher values of Y

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 139: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Properties (cont) High correlation does not imply cause and effect

CARROTS Hidden terror in the produce department at your neighborhood grocery

Everyone who ate carrots in 1920 if they are still

alive has severely wrinkled skin

Everyone who ate carrots in 1865 is now dead

45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest

>

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 140: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Properties Cause and Effect There is a strong positive correlation between

the monetary damage caused by structural fires and the number of firemen present at the fire (More firemen-more damage)

Improper training Will no firemen present result in the least amount of damage

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 141: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

Properties Cause and Effect

r measures the strength of the linear relationship between x and y it does not indicate cause and effect

x = fouls committed by player

y = points scored by same player

(x y) = (fouls points)

01020304050607080

0 5 10 15 20 25 30

Fouls

Po

ints

(12) (2475) (10) (1859) (99) (37) (535) (2046) (10) (32) (2257)

The correlation is due to a third ldquolurkingrdquo variable ndash playing time

correlation r = 935

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3
Page 142: Chapter 3 Descriptive Statistics: Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: n 1)Construct

End of Chapter 3

>
  • Chapter 3 Descriptive Statistics Graphical and Numerical Summa
  • Section 31 Displaying Categorical Data
  • The three rules of data analysis wonrsquot be difficult to remember
  • Bar Charts show counts or relative frequency for each category
  • Pie Charts shows proportions of the whole in each category
  • Example Top 10 causes of death in the United States
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Internships
  • Trend Student Debt by State (grads of public 4 yr or more)
  • Slide 14
  • Slide 15
  • Unnecessary dimension in a pie chart
  • Section 31 continued Displaying Quantitative Data
  • Frequency Histograms
  • Relative Frequency Histogram of Exam Grades
  • Histograms
  • Histograms Showing Different Centers
  • Histograms - Same Center Different Spread
  • Histograms Shape
  • Shape (cont)Female heart attack patients in New York state
  • Shape (cont) outliers All 200 m Races 202 secs or less
  • Shape (cont) Outliers
  • Excel Example 2012-13 NFL Salaries
  • Statcrunch Example 2012-13 NFL Salaries
  • Heights of Students in Recent Stats Class (Bimodal)
  • Example Grades on a statistics exam
  • Example-2 Frequency Distribution of Grades
  • Example-3 Relative Frequency Distribution of Grades
  • Relative Frequency Histogram of Grades
  • Based on the histo-gram about what percent of the values are b
  • Stem and leaf displays
  • Example employee ages at a small company
  • Suppose a 95 yr old is hired
  • Number of TD passes by NFL teams 2012-2013 season (stems are 1
  • Pulse Rates n = 138
  • AdvantagesDisadvantages of Stem-and-Leaf Displays
  • Population of 185 US cities with between 100000 and 500000
  • Back-to-back stem-and-leaf displays TD passes by NFL teams 19
  • Below is a stem-and-leaf display for the pulse rates of 24 wome
  • Other Graphical Methods for Data
  • Unemployment Rate by Educational Attainment
  • Water Use During Super Bowl XLV (Packers 31 Steelers 25)
  • Heat Maps
  • Word Wall (customer feedback)
  • Section 32 Describing the Center of Data
  • 2 characteristics of a data set to measure
  • Notation for Data Values and Sample Mean
  • Simple Example of Sample Mean
  • Population Mean
  • Connection Between Mean and Histogram
  • The median another measure of center
  • Student Pulse Rates (n=62)
  • The median splits the histogram into 2 halves of equal area
  • Mean balance point Median 50 area each half mean 5526 year
  • Medians are used often
  • Examples
  • Below are the annual tuition charges at 7 public universities
  • Below are the annual tuition charges at 7 public universities (2)
  • Properties of Mean Median
  • Example class pulse rates
  • 2010 2014 baseball salaries
  • Disadvantage of the mean
  • Mean Median Maximum Baseball Salaries 1985 - 2014
  • Skewness comparing the mean and median
  • Skewed to the left negatively skewed
  • Symmetric data
  • Section 33 Describing Variability of Data
  • Recall 2 characteristics of a data set to measure
  • Ways to measure variability
  • Example
  • The Sample Standard Deviation a measure of spread around the m
  • Calculations hellip
  • Slide 77
  • Population Standard Deviation
  • Remarks
  • Remarks (cont)
  • Remarks (cont) (2)
  • Review Properties of s and s
  • Summary of Notation
  • Section 33 (cont) Using the Mean and Standard Deviation Toget
  • 68-95-997 rule
  • The 68-95-997 rule If the histogram of the data is approximat
  • 68-95-997 rule 68 within 1 stan dev of the mean
  • 68-95-997 rule 95 within 2 stan dev of the mean
  • Example textbook costs
  • Example textbook costs (cont)
  • Example textbook costs (cont) (2)
  • Example textbook costs (cont) (3)
  • The best estimate of the standard deviation of the menrsquos weight
  • Section 33 (cont) Using the Mean and Standard Deviation Toget (2)
  • Z-scores Standardized Data Values
  • z-score corresponding to y
  • Slide 97
  • Comparing SAT and ACT Scores
  • Z-scores add to zero
  • Recently the mean tuition at 4-yr public collegesuniversities
  • Section 34 Measures of Position (also called Measures of Relat
  • Slide 102
  • Quartiles and median divide data into 4 pieces
  • Quartiles are common measures of spread
  • Rules for Calculating Quartiles
  • Example (2)
  • Pulse Rates n = 138 (2)
  • Below are the weights of 31 linemen on the NCSU football team
  • Interquartile range another measure of spread
  • Example beginning pulse rates
  • Below are the weights of 31 linemen on the NCSU football team (2)
  • 5-number summary of data
  • Slide 113
  • Boxplot display of 5-number summary
  • Slide 115
  • ATM Withdrawals by Day Month Holidays
  • Slide 117
  • Beg of class pulses (n=138)
  • Below is a box plot of the yards gained in a recent season by t
  • Rock concert deaths histogram and boxplot
  • Automating Boxplot Construction
  • Tuition 4-yr Colleges
  • Section 35 Bivariate Descriptive Statistics
  • Basic Terminology
  • Contingency Tables for Bivariate Categorical Data
  • Marginal distribution of class Bar chart
  • Marginal distribution of class Pie chart
  • Contingency Tables for Bivariate Categorical Data - 2
  • Conditional distributions segmented bar chart
  • Contingency Tables for Bivariate Categorical Data - 3
  • TV viewers during the Super Bowl in 2013 What is the marginal
  • TV viewers during the Super Bowl in 2013 What percentage watch
  • TV viewers during the Super Bowl in 2013 Given that a viewer d
  • Section 35 Bivariate Descriptive Statistics (2)
  • Slide 135
  • Scatterplot Blood Alcohol Content vs Number of Beers
  • Scatterplot Fuel Consumption vs Car Weight x=car weight y=f
  • The correlation coefficient r
  • Correlation Fuel Consumption vs Car Weight
  • Properties r ranges from -1 to+1
  • Properties (cont) High correlation does not imply cause and ef
  • Properties Cause and Effect
  • Properties Cause and Effect
  • End of Chapter 3