25
Ch. 1 Looking at Data – Distributions Displaying Distributions with Graphs Section 1.1 IPS © 2006 W.H. Freeman and Company

Ch. 1 Looking at Data – Distributions Displaying Distributions with Graphs Section 1.1 IPS © 2006 W.H. Freeman and Company

Embed Size (px)

Citation preview

Page 1: Ch. 1 Looking at Data – Distributions Displaying Distributions with Graphs Section 1.1 IPS © 2006 W.H. Freeman and Company

Ch. 1 Looking at Data – Distributions

Displaying Distributions with Graphs

Section 1.1 IPS

© 2006 W.H. Freeman and Company

Page 2: Ch. 1 Looking at Data – Distributions Displaying Distributions with Graphs Section 1.1 IPS © 2006 W.H. Freeman and Company

Data

Statistics is the science of learning from data.

Data are numerical facts.

Data are numbers with a context.

To make sense of the numbers,

we must understand the context.

Page 3: Ch. 1 Looking at Data – Distributions Displaying Distributions with Graphs Section 1.1 IPS © 2006 W.H. Freeman and Company

Data Set

A Data Set is a list of individual objects and variables.

A case is the data for one individual.

A variable is a characteristic of an individual. Its value varies among individuals.

The distribution of a variable tells us what values the variable takes and how often it takes these values

Page 4: Ch. 1 Looking at Data – Distributions Displaying Distributions with Graphs Section 1.1 IPS © 2006 W.H. Freeman and Company

Two types of variables

A variable is either quantitative or categorical, depending on the type of value it can take.

A quantitative variable takes numerical values.

A categorical variable places individuals into one of several categories (or groups).

Page 5: Ch. 1 Looking at Data – Distributions Displaying Distributions with Graphs Section 1.1 IPS © 2006 W.H. Freeman and Company

Individualsin sample

DIAGNOSIS AGE AT DEATH

Patient A Heart disease 56

Patient B Stroke 70

Patient C Stroke 75

Patient D Lung cancer 60

Patient E Heart disease 80

Patient F Accident 73

Patient G Diabetes 69

QuantitativeEach individual is

attributed a numerical value.

CategoricalEach individual is assigned to one of several categories.

Page 6: Ch. 1 Looking at Data – Distributions Displaying Distributions with Graphs Section 1.1 IPS © 2006 W.H. Freeman and Company

Graphs of Variables

Graphs highlight important features of a data set and often reveal relationships

that are not apparent from a listing of the data.

The type of graph we use depends on

the type of variable.

Page 7: Ch. 1 Looking at Data – Distributions Displaying Distributions with Graphs Section 1.1 IPS © 2006 W.H. Freeman and Company

Graphs for Categorical VariablesThe distribution of a categorical variable gives the count or percent of individuals in each category. It is represented visually by a bar graph or a pie chart.

Bar graph Pie chartThe count of each category is The percent in each category isrepresented by the height of a bar. represented by a slice of the pie.

Page 8: Ch. 1 Looking at Data – Distributions Displaying Distributions with Graphs Section 1.1 IPS © 2006 W.H. Freeman and Company

0100200300400500600700800

Counts

(x1000)

Bar graph sorted by rank Easy to analyze

Top 10 causes of deaths in the United States 2001

0100200300400500600700800

Cou

nts

(x10

00)

Sorted alphabetically Much less useful

Page 9: Ch. 1 Looking at Data – Distributions Displaying Distributions with Graphs Section 1.1 IPS © 2006 W.H. Freeman and Company

Percent of people dying fromtop 10 causes of death in the United States in 2000

Pie chartsEach slice represents a piece of one whole. The size of a slice depends on what

percent of the whole this category represents.

Page 10: Ch. 1 Looking at Data – Distributions Displaying Distributions with Graphs Section 1.1 IPS © 2006 W.H. Freeman and Company

Child poverty before and after government intervention—UNICEF, 1996

What does this chart tell you?

•The United States has the highest rate of child

poverty among developed nations (22% of under 18).

•Its government does the least—through taxes and

subsidies—to remedy the problem (size of orange

bars and percent difference between orange/blue

bars).

Could you transform this bar graph to fit in 1 pie chart? In two pie charts? Why?

The poverty line is defined as 50% of national median income.

Page 11: Ch. 1 Looking at Data – Distributions Displaying Distributions with Graphs Section 1.1 IPS © 2006 W.H. Freeman and Company

Figure 1.1 p. 8

Which graph is a better representation of the data on p. 7?

Page 12: Ch. 1 Looking at Data – Distributions Displaying Distributions with Graphs Section 1.1 IPS © 2006 W.H. Freeman and Company

Graphs for Quantitative Variables Stemplots

Histograms

Line graphs and time plots

Page 13: Ch. 1 Looking at Data – Distributions Displaying Distributions with Graphs Section 1.1 IPS © 2006 W.H. Freeman and Company

Stem plots

How to make a stemplot:

1) Separate each observation into a stem,

consisting of all but the final (rightmost) digit,

and a leaf, which displays the final digit.

Stems may have as many digits as needed,

but each leaf contains only a single digit.

2) Write the stems in a vertical column with the smallest

value at the top, and draw a vertical line at the right

of this column.

3) Write each leaf in the row to the right of its stem, in

increasing order out from the stem.

STEM LEAVES

Page 14: Ch. 1 Looking at Data – Distributions Displaying Distributions with Graphs Section 1.1 IPS © 2006 W.H. Freeman and Company

Example 1.5 p. 11. a. Do a stem plot for the female percent. b. Then do a histogram of the same data set. c. Split the stems

Page 15: Ch. 1 Looking at Data – Distributions Displaying Distributions with Graphs Section 1.1 IPS © 2006 W.H. Freeman and Company

Example 1.5 p. 11. Do a back-to-back stem plot of the female and male percent..

Page 16: Ch. 1 Looking at Data – Distributions Displaying Distributions with Graphs Section 1.1 IPS © 2006 W.H. Freeman and Company

Histograms

The range of values a variable can take is divided into equal size intervalscalled classes or bins.

The height of each bar shows the number (or %) of individual data points that fall in each interval.

The first bar represents all states where the percent of Hispanics in their

population is between 0% and 4.99%. The height of the bar shows how

many states (27) have a percent Hispanic in this range.

The last bar represents all states with a percent Hispanic between 40% and

44.99%. There is only one such state: New Mexico, at 42.1% Hispanics.

Page 17: Ch. 1 Looking at Data – Distributions Displaying Distributions with Graphs Section 1.1 IPS © 2006 W.H. Freeman and Company

Stemplots are quick and dirty histograms that can easily be done by

hand, therefore very convenient for back of the envelope calculations.

However, they are rarely found in scientific or laymen publications.

Stemplots versus histograms

Page 18: Ch. 1 Looking at Data – Distributions Displaying Distributions with Graphs Section 1.1 IPS © 2006 W.H. Freeman and Company

Distribution of a Variable The distribution of a variable tells us what values the variable takes and how often it takes these values.

When examining a distribution, look for the following:

SHAPE of the distribution. Some shapes are symmetric or skewed.

Some shapes have a number of modes (major peaks).

CENTER of the distribution.The center is the middle of the data.

SPREAD of the distribution.The spread is the range of values.

OUTLIERS and deviations from the overall shape.Outliers are observations that lie outside

the overall pattern of a distribution.

Page 19: Ch. 1 Looking at Data – Distributions Displaying Distributions with Graphs Section 1.1 IPS © 2006 W.H. Freeman and Company

Most common distribution shapes

A distribution is symmetric if the right and left sides

of the histogram are approximately mirror images of

each other.

Symmetric distribution

Complex, multimodal distribution

Not all distributions have a simple overall shape,

especially when there are few observations.

Skewed distribution

A distribution is skewed to the right if the right

side of the histogram (side with larger values)

extends much farther out than the left side. It is

skewed to the left if the left side of the histogram

extends much farther out than the right side.

Page 20: Ch. 1 Looking at Data – Distributions Displaying Distributions with Graphs Section 1.1 IPS © 2006 W.H. Freeman and Company

Alaska Florida

Outliers

Outliers are observations that lie outside the overall pattern of a

distribution. Always look for outliers and try to explain them.

The overall pattern is fairly

symmetrical except for 2

states clearly not belonging

to the main trend. Alaska

and Florida have unusual

representation of the

elderly in their population.

A large gap in the

distribution is typically a

sign of an outlier.

Page 21: Ch. 1 Looking at Data – Distributions Displaying Distributions with Graphs Section 1.1 IPS © 2006 W.H. Freeman and Company

How to create a histogram

The shape of a histogram is determined by the bin size.

What bin size should you use?

Not too many bins with either 0 or 1 counts

Not overly summarized (large bins) that you loose all the information

Not so detailed (small bins) that it is no longer summary

rule of thumb: start with 5 to10 bins

Look at the distribution and refine your bins

(There isn’t a unique or “perfect” solution)

Page 22: Ch. 1 Looking at Data – Distributions Displaying Distributions with Graphs Section 1.1 IPS © 2006 W.H. Freeman and Company

Not summarized enough

Too summarized

Same data set

Page 23: Ch. 1 Looking at Data – Distributions Displaying Distributions with Graphs Section 1.1 IPS © 2006 W.H. Freeman and Company

Line graphs: time plots

A trend is a rise or fall that persist over time, despite small irregularities.

In a time plot, time always goes on the horizontal, x axis.

We describe time series by looking for an overall pattern and for striking

deviations from that pattern. In a time series:

A pattern that repeats itself at regular intervals of time

is called seasonal variation.

Page 24: Ch. 1 Looking at Data – Distributions Displaying Distributions with Graphs Section 1.1 IPS © 2006 W.H. Freeman and Company

Death rates from cancer (US, 1945-95)

0

50

100

150

200

250

1940 1950 1960 1970 1980 1990 2000

Years

Death

rate

(per

thousand)

Death rates from cancer (US, 1945-95)

0

50

100

150

200

250

1940 1960 1980 2000

Years

Dea

th r

ate

(per

thou

sand

)

Death rates from cancer (US, 1945-95)

0

50

100

150

200

250

1940 1960 1980 2000

Years

Death

rate

(per

thousand)

A picture is worth a thousand words,

BUT

There is nothing like hard numbers.

Look at the scales.

Scales matterHow you stretch the axes and choose your scales can give a different impression.

Death rates from cancer (US, 1945-95)

120

140

160

180

200

220

1940 1960 1980 2000

Years

Death

rate

(pe

r th

ousan

d)

Page 25: Ch. 1 Looking at Data – Distributions Displaying Distributions with Graphs Section 1.1 IPS © 2006 W.H. Freeman and Company

Using Excel1. Do a bar graph for Problem 1.14 (page 28).

Format the data set for its appearance, then copy and paste the data set and graph into Text Boxes in MS Word. Arrange the layout so that it has a professional look.

2. Manually do a stem plot for Problem 1.17. (page 29)Construct a histogram by doing a frequency count on the leaves and then a bar graph of the frequency count.

3. Do a back-to-back stem plot for Problem 1.18. (page 29)

4. Do a bar graph and time plot for Problem 1.12 (page 27).