Upload
rafe-jefferson
View
235
Download
0
Tags:
Embed Size (px)
Citation preview
Ch. 1 Looking at Data – Distributions
Displaying Distributions with Graphs
Section 1.1 IPS
© 2006 W.H. Freeman and Company
Data
Statistics is the science of learning from data.
Data are numerical facts.
Data are numbers with a context.
To make sense of the numbers,
we must understand the context.
Data Set
A Data Set is a list of individual objects and variables.
A case is the data for one individual.
A variable is a characteristic of an individual. Its value varies among individuals.
The distribution of a variable tells us what values the variable takes and how often it takes these values
Two types of variables
A variable is either quantitative or categorical, depending on the type of value it can take.
A quantitative variable takes numerical values.
A categorical variable places individuals into one of several categories (or groups).
Individualsin sample
DIAGNOSIS AGE AT DEATH
Patient A Heart disease 56
Patient B Stroke 70
Patient C Stroke 75
Patient D Lung cancer 60
Patient E Heart disease 80
Patient F Accident 73
Patient G Diabetes 69
QuantitativeEach individual is
attributed a numerical value.
CategoricalEach individual is assigned to one of several categories.
Graphs of Variables
Graphs highlight important features of a data set and often reveal relationships
that are not apparent from a listing of the data.
The type of graph we use depends on
the type of variable.
Graphs for Categorical VariablesThe distribution of a categorical variable gives the count or percent of individuals in each category. It is represented visually by a bar graph or a pie chart.
Bar graph Pie chartThe count of each category is The percent in each category isrepresented by the height of a bar. represented by a slice of the pie.
0100200300400500600700800
Counts
(x1000)
Bar graph sorted by rank Easy to analyze
Top 10 causes of deaths in the United States 2001
0100200300400500600700800
Cou
nts
(x10
00)
Sorted alphabetically Much less useful
Percent of people dying fromtop 10 causes of death in the United States in 2000
Pie chartsEach slice represents a piece of one whole. The size of a slice depends on what
percent of the whole this category represents.
Child poverty before and after government intervention—UNICEF, 1996
What does this chart tell you?
•The United States has the highest rate of child
poverty among developed nations (22% of under 18).
•Its government does the least—through taxes and
subsidies—to remedy the problem (size of orange
bars and percent difference between orange/blue
bars).
Could you transform this bar graph to fit in 1 pie chart? In two pie charts? Why?
The poverty line is defined as 50% of national median income.
Figure 1.1 p. 8
Which graph is a better representation of the data on p. 7?
Graphs for Quantitative Variables Stemplots
Histograms
Line graphs and time plots
Stem plots
How to make a stemplot:
1) Separate each observation into a stem,
consisting of all but the final (rightmost) digit,
and a leaf, which displays the final digit.
Stems may have as many digits as needed,
but each leaf contains only a single digit.
2) Write the stems in a vertical column with the smallest
value at the top, and draw a vertical line at the right
of this column.
3) Write each leaf in the row to the right of its stem, in
increasing order out from the stem.
STEM LEAVES
Example 1.5 p. 11. a. Do a stem plot for the female percent. b. Then do a histogram of the same data set. c. Split the stems
Example 1.5 p. 11. Do a back-to-back stem plot of the female and male percent..
Histograms
The range of values a variable can take is divided into equal size intervalscalled classes or bins.
The height of each bar shows the number (or %) of individual data points that fall in each interval.
The first bar represents all states where the percent of Hispanics in their
population is between 0% and 4.99%. The height of the bar shows how
many states (27) have a percent Hispanic in this range.
The last bar represents all states with a percent Hispanic between 40% and
44.99%. There is only one such state: New Mexico, at 42.1% Hispanics.
Stemplots are quick and dirty histograms that can easily be done by
hand, therefore very convenient for back of the envelope calculations.
However, they are rarely found in scientific or laymen publications.
Stemplots versus histograms
Distribution of a Variable The distribution of a variable tells us what values the variable takes and how often it takes these values.
When examining a distribution, look for the following:
SHAPE of the distribution. Some shapes are symmetric or skewed.
Some shapes have a number of modes (major peaks).
CENTER of the distribution.The center is the middle of the data.
SPREAD of the distribution.The spread is the range of values.
OUTLIERS and deviations from the overall shape.Outliers are observations that lie outside
the overall pattern of a distribution.
Most common distribution shapes
A distribution is symmetric if the right and left sides
of the histogram are approximately mirror images of
each other.
Symmetric distribution
Complex, multimodal distribution
Not all distributions have a simple overall shape,
especially when there are few observations.
Skewed distribution
A distribution is skewed to the right if the right
side of the histogram (side with larger values)
extends much farther out than the left side. It is
skewed to the left if the left side of the histogram
extends much farther out than the right side.
Alaska Florida
Outliers
Outliers are observations that lie outside the overall pattern of a
distribution. Always look for outliers and try to explain them.
The overall pattern is fairly
symmetrical except for 2
states clearly not belonging
to the main trend. Alaska
and Florida have unusual
representation of the
elderly in their population.
A large gap in the
distribution is typically a
sign of an outlier.
How to create a histogram
The shape of a histogram is determined by the bin size.
What bin size should you use?
Not too many bins with either 0 or 1 counts
Not overly summarized (large bins) that you loose all the information
Not so detailed (small bins) that it is no longer summary
rule of thumb: start with 5 to10 bins
Look at the distribution and refine your bins
(There isn’t a unique or “perfect” solution)
Not summarized enough
Too summarized
Same data set
Line graphs: time plots
A trend is a rise or fall that persist over time, despite small irregularities.
In a time plot, time always goes on the horizontal, x axis.
We describe time series by looking for an overall pattern and for striking
deviations from that pattern. In a time series:
A pattern that repeats itself at regular intervals of time
is called seasonal variation.
Death rates from cancer (US, 1945-95)
0
50
100
150
200
250
1940 1950 1960 1970 1980 1990 2000
Years
Death
rate
(per
thousand)
Death rates from cancer (US, 1945-95)
0
50
100
150
200
250
1940 1960 1980 2000
Years
Dea
th r
ate
(per
thou
sand
)
Death rates from cancer (US, 1945-95)
0
50
100
150
200
250
1940 1960 1980 2000
Years
Death
rate
(per
thousand)
A picture is worth a thousand words,
BUT
There is nothing like hard numbers.
Look at the scales.
Scales matterHow you stretch the axes and choose your scales can give a different impression.
Death rates from cancer (US, 1945-95)
120
140
160
180
200
220
1940 1960 1980 2000
Years
Death
rate
(pe
r th
ousan
d)
Using Excel1. Do a bar graph for Problem 1.14 (page 28).
Format the data set for its appearance, then copy and paste the data set and graph into Text Boxes in MS Word. Arrange the layout so that it has a professional look.
2. Manually do a stem plot for Problem 1.17. (page 29)Construct a histogram by doing a frequency count on the leaves and then a bar graph of the frequency count.
3. Do a back-to-back stem plot for Problem 1.18. (page 29)
4. Do a bar graph and time plot for Problem 1.12 (page 27).