Ch. 1 Looking at Data – Distributions Displaying Distributions with Graphs Section 1.1 IPS © 2006 W.H. Freeman and Company

Ch. 1 Looking at Data – Distributions

Displaying Distributions with Graphs

Section 1.1 IPS

© 2006 W.H. Freeman and Company

Data

Statistics is the science of learning from data.

Data are numerical facts.

Data are numbers with a context.

To make sense of the numbers,

we must understand the context.

Data Set

A Data Set is a list of individual objects and variables.

A case is the data for one individual.

A variable is a characteristic of an individual. Its value varies among individuals.

The distribution of a variable tells us what values the variable takes and how often it takes these values

Two types of variables

A variable is either quantitative or categorical, depending on the type of value it can take.

A quantitative variable takes numerical values.

A categorical variable places individuals into one of several categories (or groups).

Individualsin sample

DIAGNOSIS AGE AT DEATH

Patient A Heart disease 56

Patient B Stroke 70

Patient C Stroke 75

Patient D Lung cancer 60

Patient E Heart disease 80

Patient F Accident 73

Patient G Diabetes 69

QuantitativeEach individual is

attributed a numerical value.

CategoricalEach individual is assigned to one of several categories.

Graphs of Variables

Graphs highlight important features of a data set and often reveal relationships

that are not apparent from a listing of the data.

The type of graph we use depends on

the type of variable.

Graphs for Categorical VariablesThe distribution of a categorical variable gives the count or percent of individuals in each category. It is represented visually by a bar graph or a pie chart.

Bar graph Pie chartThe count of each category is The percent in each category isrepresented by the height of a bar. represented by a slice of the pie.

0100200300400500600700800

Counts

(x1000)

Bar graph sorted by rank Easy to analyze

Top 10 causes of deaths in the United States 2001

0100200300400500600700800

Cou

nts

(x10

00)

Sorted alphabetically Much less useful

Percent of people dying fromtop 10 causes of death in the United States in 2000

Pie chartsEach slice represents a piece of one whole. The size of a slice depends on what

percent of the whole this category represents.

Child poverty before and after government intervention—UNICEF, 1996

What does this chart tell you?

•The United States has the highest rate of child

poverty among developed nations (22% of under 18).

•Its government does the least—through taxes and

subsidies—to remedy the problem (size of orange

bars and percent difference between orange/blue

bars).

Could you transform this bar graph to fit in 1 pie chart? In two pie charts? Why?

The poverty line is defined as 50% of national median income.

Figure 1.1 p. 8

Which graph is a better representation of the data on p. 7?

Graphs for Quantitative Variables Stemplots

Histograms

Line graphs and time plots

Stem plots

How to make a stemplot:

1) Separate each observation into a stem,

consisting of all but the final (rightmost) digit,

and a leaf, which displays the final digit.

Stems may have as many digits as needed,

but each leaf contains only a single digit.

2) Write the stems in a vertical column with the smallest

value at the top, and draw a vertical line at the right

of this column.

3) Write each leaf in the row to the right of its stem, in

increasing order out from the stem.

STEM LEAVES

Example 1.5 p. 11. a. Do a stem plot for the female percent. b. Then do a histogram of the same data set. c. Split the stems

Example 1.5 p. 11. Do a back-to-back stem plot of the female and male percent..

Histograms

The range of values a variable can take is divided into equal size intervalscalled classes or bins.

The height of each bar shows the number (or %) of individual data points that fall in each interval.

The first bar represents all states where the percent of Hispanics in their

population is between 0% and 4.99%. The height of the bar shows how

many states (27) have a percent Hispanic in this range.

The last bar represents all states with a percent Hispanic between 40% and

44.99%. There is only one such state: New Mexico, at 42.1% Hispanics.

Stemplots are quick and dirty histograms that can easily be done by

hand, therefore very convenient for back of the envelope calculations.

However, they are rarely found in scientific or laymen publications.

Stemplots versus histograms

Distribution of a Variable The distribution of a variable tells us what values the variable takes and how often it takes these values.

When examining a distribution, look for the following:

SHAPE of the distribution. Some shapes are symmetric or skewed.

Some shapes have a number of modes (major peaks).

CENTER of the distribution.The center is the middle of the data.

SPREAD of the distribution.The spread is the range of values.

OUTLIERS and deviations from the overall shape.Outliers are observations that lie outside

the overall pattern of a distribution.

Most common distribution shapes

A distribution is symmetric if the right and left sides

of the histogram are approximately mirror images of

each other.

Symmetric distribution

Complex, multimodal distribution

Not all distributions have a simple overall shape,

especially when there are few observations.

Skewed distribution

A distribution is skewed to the right if the right

side of the histogram (side with larger values)

extends much farther out than the left side. It is

skewed to the left if the left side of the histogram

extends much farther out than the right side.

Alaska Florida

Outliers

Outliers are observations that lie outside the overall pattern of a

distribution. Always look for outliers and try to explain them.

The overall pattern is fairly

symmetrical except for 2

states clearly not belonging

to the main trend. Alaska

and Florida have unusual

representation of the

elderly in their population.

A large gap in the

distribution is typically a

sign of an outlier.

How to create a histogram

The shape of a histogram is determined by the bin size.

What bin size should you use?

Not too many bins with either 0 or 1 counts

Not overly summarized (large bins) that you loose all the information

Not so detailed (small bins) that it is no longer summary

rule of thumb: start with 5 to10 bins

Look at the distribution and refine your bins

(There isn’t a unique or “perfect” solution)

Not summarized enough

Too summarized

Same data set

Line graphs: time plots

A trend is a rise or fall that persist over time, despite small irregularities.

In a time plot, time always goes on the horizontal, x axis.

We describe time series by looking for an overall pattern and for striking

deviations from that pattern. In a time series:

A pattern that repeats itself at regular intervals of time

is called seasonal variation.

Death rates from cancer (US, 1945-95)

0

50

100

150

200

250

1940 1950 1960 1970 1980 1990 2000

Years

Death

rate

(per

thousand)


0

50

100

150

200

250

1940 1960 1980 2000

Years

Dea

th r

ate

(per

thou

sand

)


0

50

100

150

200

250

1940 1960 1980 2000

Years

Death

rate

(per

thousand)

A picture is worth a thousand words,

BUT

There is nothing like hard numbers.

Look at the scales.

Scales matterHow you stretch the axes and choose your scales can give a different impression.


120

140

160

180

200

220

1940 1960 1980 2000

Years

Death

rate

(pe

r th

ousan

d)

Using Excel1. Do a bar graph for Problem 1.14 (page 28).

Format the data set for its appearance, then copy and paste the data set and graph into Text Boxes in MS Word. Arrange the layout so that it has a professional look.

2. Manually do a stem plot for Problem 1.17. (page 29)Construct a histogram by doing a frequency count on the leaves and then a bar graph of the frequency count.

3. Do a back-to-back stem plot for Problem 1.18. (page 29)

4. Do a bar graph and time plot for Problem 1.12 (page 27).

Documents

Ch. 1 Looking at Data – Distributions Displaying Distributions with Graphs Section 1.1 IPS © 2006 W.H. Freeman and Company