Describing data with graphics and numbers

Preview:

DESCRIPTION

Describing data with graphics and numbers. Types of Data. Categorical Variables also known as class variables, nominal variables Quantitative Variables aka numerical nariables either continuous or discrete. Graphing categorical variables. - PowerPoint PPT Presentation

Citation preview

Describing datawith graphicsand numbers

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Types of Data

•Categorical Variables –also known as class variables, nominal variables

•Quantitative Variables –aka numerical nariables

–either continuous or discrete.

Graphing categorical variables

Ten most common causes of death in Americans between 15 and 19 years old in 1999.

Bar graphs

Graphing numerical variables

Heights of BIOL 300 students (cm)

165 168 163 173 170 163 170 155 152 190 170 168 142 160 154 165 156 177 173 165 165 175

155 166 168 165 180 165

Stem-and-leaf plot

Stem-and-leaf plot

191817161514

000 0 0 3 3 5 70 3 3 5 5 5 5 5 5 6 8 8 82 4 5 5 6 2

Frequency table

Height Group

Frequency

141-150

151-160

161-170

171-180

181-190

Frequency table

Height Group

Frequency

141-150 1

151-160 6

161-170 15

171-180 5

181-190 1

Histogram

Histogram

HistogramFrequency distribution

Histogram with more data

150 160 170 180 190 200 210

0.2

0.4

0.6

0.8

1

Cumulative

Frequency

Height (in cm) of Bio300 Students

Cumulative Frequency Distribution

150 160 170 180 190 200 210

0.2

0.4

0.6

0.8

1

Cumulative

Frequency

Height (in cm) of Bio300 Students

Cumulative Frequency Distribution

90th percentile50th percentile(median)

Associations between two categorical variables

Association between reproductive effort and avian

malariaTable 2.3A. Contingency table showing incidence of

malaria in female great tits subjected to experimental

egg removal.

contro lgroup

egg removalgroup

rowtotal

malaria 7 15 22nomalaria

28 15 43

columntotal

35 30 65

Association between reproductive effort and avian

malariaTable 2.3A. Contingency table showing incidence of

malaria in female great tits subjected to experimental

egg removal.

contro lgroup

egg removalgroup

rowtotal

malaria 7 15 22nomalaria

28 15 43

columntotal

35 30 65

Mosaic plot

Control Egg removal

0.0

0.2

0.4

0.6

0.8

1.0

Treatment

Relative frequency

Figure 2.3B. Mosaic plot for reproductive effort and avian malariain great tits (Table 2.3A). Blue fill indicates diseased birds whereasthe white fill indicates birds free of malaria. n = 65 birds.

Grouped Bar Graph

Malaria No malaria Malaria No malaria

0

5

10

15

20

25

Control Egg removal

Associations between categorical and numerical

variables

Multiple histograms

0 200 400 600 800 1000

0

200

400

600

0

200

400

600

Non-conserved

0 200 400 600 800 1000

Protein length

Conserved

Associations between two numerical variables

Scatterplots

Scatterplots

Evaluating Graphics

• Lie factor

• Chartjunk

• EfficiencyQuickTime™ and a

TIFF (Uncompressed) decompressorare needed to see this picture.

Don’t mislead with graphics

Better representation of truth

Lie Factor

• Lie factor = size of effect shown in graphic

size of effect in data

Lie Factor Example

Effect in graphic: 2.33/0.08= 29.1

Effect in data: 6748/5844= 1.15

Lie factor = 29.1 / 1.15= 25.3

ChartjunkChartjunk

0 50 100

1st Qtr

2nd Qtr

3rd Qtr

4th Qtr

NorthWestEast

Needless 3D Graphics

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Summary: Graphical methods for frequency distributions

Type of Data MethodCategorical data Bar graph

Numerical dataHistogram

Cumulative frequency distribution

Summary: Associations between variables

Explanatory variableResponse variable Categorical Numerical

CategoricalContingency tableGrouped bar graph

Mosaic plot

NumericalMultiple histograms

Cumulative frequency distributionsScatter plot

Great book on graphics

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Describing data

Two common descriptions of data

• Location (or central tendency)

• Width (or spread)

Measures of location

Mean

Median

Mode

Mean

Y =

Yi

i=1

n

∑n

n is the size of the sample

Mean

Y1=56, Y2=72, Y3=18, Y4=42

Mean

Y1=56, Y2=72, Y3=18, Y4=42

= (56+72+18+42) / 4 = 47

Y

Median

• The median is the middle measurement in a set of ordered data.

The data:

18 28 24 25 36 14 34

The data:

18 28 24 25 36 14 34

can be put in order:

14 18 24 25 28 34 36

Median is 25.

0.0

2.5

5.0

7.5

10.0

12.5

5 6 7 8 9 10 11 12 13 14 15 16 17 18

Frequency

Mouse weight at 50 days old, in

a line selected for small size

Mean

Mode

Median

Mean vs. median in politics

• 2004 U.S. Economy

• Republicans: times are good– Mean income increasing ~ 4% per year

• Democrats: times are bad– Median family income fell

• Why?

Mean 169.3 cm

Median 170 cm

Mode 165-170 cm

150 160 170 180 190 200 210

0.2

0.4

0.6

0.8

1

Cumulative

Frequency

Height (in cm) of Bio300 Students

Measures of width

• Range

• Standard deviation

• Variance

• Coefficient of variation

Range

14 17 18 20 22 22 24 25 26 28 28 28 30 34 36

Range

14 17 18 20 22 22 24 25 26 28 28 28 30 34 36

The range is 36-14 = 22

Population Variance

σ 2 =

Yi − μ( )2

i=1

N

∑N

Sample variance

s2 =

Yi −Y ( )2

i=1

n

∑n −1

n is the sample size

Shortcut for calculating sample variance

s2 =n

n −1

⎝ ⎜

⎠ ⎟

Yi2

i=1

n

∑n

−Y 2

⎜ ⎜ ⎜ ⎜

⎟ ⎟ ⎟ ⎟

Standard deviation (SD)

• Positive square root of the variance

σ is the true standard deviations is the sample standard deviation

In class exercise

Calculate the variance and standard deviation of a sample

with the following data:

6, 1, 2

Answer

Variance=7Standard deviation =

7

Coefficient of variance (CV)

CV = 100 s / .

Y

Equal means, different variances

-5 0 5 10

0.1

0.2

0.3

0.4

Value

Frequency

V = 1

V=2

V=10

Manipulating means

• The mean of the sum of two variables:

E[X + Y] = E[X]+ E[Y]

• The mean of the sum of a variable and a constant:

E[X + c] = E[X]+ c

• The mean of a product of a variable and a constant:

E[c X] = c E[X]

• The mean of a product of two variables:

E[X Y] = E[X] E[Y]

if and only if X and Y are independent.

Manipulating variance

• The variance of the sum of two variables:

Var[X + Y] = Var[X]+ Var[Y]

if and only if X and Y are independent.

• The variance of the sum of a variable and a constant:

Var[X + c] = Var[X]

• The variance of a product of a variable and a constant:

Var[c X] = c2 Var[X]

Parents’ heights

Mean Variance

Father Height

174.3 71.7

Mother Height

160.4 58.3

Father Height +Mother Height

334.7 184.9

Recommended