Upload
others
View
24
Download
0
Embed Size (px)
Citation preview
DePutron AP Statistics Unit I Packet: Exploring and Describing Data 1
PART I - Exploring and Describing Data Agenda
Day
Number Chapter Topics to discuss Homework Due Next Time
1 Class Intro The purpose of statistics, receive
packet & syllabus, Day One Survey Complete all tasks on TO DO list.
2 Prelim Intro to Stats experiment Finish questions from the textbook
3 Prelim Simulating an experiment Case closed, page 26
4 Prelim Identify population, variables,
individuals
Identify population, variables, and
individuals for given questions
Day One Survey: Please fill out the survey questions below. When you have finished, come visit me to record your email address in the school computer. You will then be sent this survey to input your answers tonight on SurveyMonkey.com.
Gender
Gra
duating
Cla
ss Y
ear
Left
or
Rig
ht-
Handed
Left
or
Rig
ht-
Eyed
(look b
elo
w!)
Shoe S
ize
(no h
alf-s
izes,
ple
ase)
# o
f Sib
lings
# o
f Aunts
&
Uncle
s
#of Cousin
s
Min
ute
s t
o g
et
to s
chool
measure
in
MIL
LIM
ETERS
the t
hic
kness o
f
textb
ook
Left-Eyed and Right-Eyed-ness When you pick up a pencil or pen and write, what hand do you typically use? People are also left-eye dominant or right-eye dominant. Which one are you? Here’s how to find out: Hold your hands in front of you like the picture. Find an object about 10-15 feet away. Make a small space to look through. Now, close your right eye, keeping your left eye open. Can you still see it? If so, then you are left-eye dominant. If you can’t see the object, open your right eye and close your left eye. Can you see it? If so, then you are right-eye dominant. Write down one question you have from doing this activity in the space below:
DePutron AP Statistics Unit I Packet: Exploring and Describing Data 2
TO DO BEFORE THE 2ND CLASS: ____ 1. Get a 3 ring binder with 8 dividers ____ 2. Loose leaf AND graph paper ____ 3. Get a graphing calculator (TI-84 Recommended). Come see me, if necessary. ____ 4. Read the syllabus. ____ 5. Complete the Day One Survey on surveymonkey.com. ____ 6. Send me an email: [email protected], include:
1. First and last name 2. Questions from the syllabus 3. A statistical question that you are interested in answering this year (i.e. Does increased cell phone use lead to brain tumors?; Does the number of AP courses a
student takes increase his/her likelihood of going to college?) Statistical Question:
DePutron AP Statistics Unit I Packet: Exploring and Describing Data 3
Preliminary Chapter: Overview of the course Bookwork Define statistics. Explain the difference between a population and a sample. Pg. 11, P.2 (a) Pg. 11, P.4 (a) (b) (c) Pg. 11, P.5 (a) Define individuals. Define variable. Pg. 21, P.12 (a) (b) (c) (d) (e) (f)
DePutron AP Statistics Unit I Packet: Exploring and Describing Data 4
Tap Water vs. Bottled Water Experiment Bottled water is becoming an increasingly popular alternative to ordinary tap water. But can people really tell the difference if they aren’t told which is which? Do you think you can tell the difference between bottled water and tap water? This activity will give you the chance to discover answers to both of these questions. How would you design a study to determine if someone could tell the difference? What are some things to keep in mind in conducting this study in order to validate the results? How will we account for these? How will we decide in which order to present the cups to the student? What would you expect? Class Results Record the number of correct and incorrect choices in the table below.
Correct Incorrect Proportion
Let’s assume that no one can really distinguish tap water from bottled water. So if a student guesses correctly, it is only due to luck. How many students would you expect to guess correctly? [no evidence]
Correct Incorrect Proportion
Let’s assume that it is very easy to distinguish the bottled water from the tap water. How many students would you expect to guess correctly? [clear evidence]
Correct Incorrect Proportion
How many correct answers would you need to see to be convinced that it is possible to tell the difference between tap water and bottled water? [possible evidence]
Correct Incorrect Proportion
Did our class do better than blindly guessing? Was our proportion significantly higher than blindly guessing to say that some students can distinguish between the flavors of two popular sodas? Identify the individuals and variables in this experiment:
DePutron AP Statistics Unit I Packet: Exploring and Describing Data 5
Simulating the Experiment Now let’s simulate this experiment to see if random guessing could have produced a result as high as ours. How could we do this using a die? Now record the number of correct guesses: _____ We will create a dot plot of these simulations using a Statistical Data Software called Fathom as a class. Copy our class’s dot plot below. Dot Plot
What proportion of these simulations did as well or better than our class did in choosing the bottled water?
DePutron AP Statistics Unit I Packet: Exploring and Describing Data 6
Populations, Individuals, and Variables In any study, we are collecting data (information) about individuals, or cases. Sometimes these
individuals are also called experimental units, if we are conducting an experiment.
For each case, we record one or more variables: attributes, or characteristics about each
individual in a study. We typically have a research question about a population of interest.
Individuals or Cases: Who (or what) did we gather information from?
Variables: What values did we record for each case? Also,
Are the variables quantitative or categorical?
If quantitative, do we know the units of each variable?
How was each variable measured?
Population of interest in the study: What group does the researcher want to make conclusions
about?
Often, the group of individuals that were studied is only a subset of the population. We
call this group the sample.
In statistics, we often try to make predictions about a large population from a sample.
What is the research question?
Is there a suspicion / hypothesis being tested?
Are data being collected in order to learn more about a group?
Example from Yahoo Sports on the Los Angeles Lakers:
Each row contains data on a single
case (case #1, case #2, etc…)
The heading of each column is the
name of a variable (player, position,
bat, height, etc…)
Each variable has different possible
outcomes. (For weight: many
different outcomes. For position: PG,
SG, SF, etc.)
Can you determine who any of the players are based on these statistics??
DePutron AP Statistics Unit I Packet: Exploring and Describing Data 7
Example 1:
Read the following article. Then, identify the individuals, the intended population, the variable, the research question, and any conclusions.
New York Times
AUGUST 22, 2011, 5:14 PM
Really? The Claim: Drinking Green Tea Can Help Lower Cholesterol
By ANAHAD O'CONNOR
THE FACTS
Green tea is thought to be an herbal panacea of sorts, believed by many to have a wide range of health
benefits. But whether it can actually produce measurable effects on cholesterol is a question that has
drawn much debate.
Advocates say green tea’s heart-healthy benefits are due in part to a large concentration of polyphenols,
which block the absorption of cholesterol in the gut. But skeptics argue that any beneficial effect would
be small, and the side effects from a few too many cups a day not worth it.
Numerous studies have delved into the matter, with mixed results. But this year a team of researchers
combined and analyzed data from more than a dozen previous trials to reach a more definitive answer.
The report, published in The American Journal of Clinical Nutrition, involved more than 1,100 people
and looked at studies in which the subjects were randomly assigned to drink either green tea or a placebo
daily for up to several months.
The researchers found that the subjects who received the green tea, on average, did see an effect on their
cholesterol, but it was minimal. Over all, their levels of LDL, or “bad” cholesterol, fell by 2.2
milligrams per deciliter, a change of roughly 2 percent. There was no effect on their levels of HDL, or
“good” cholesterol.
For some it may be worth a shot. But for others there could be side effects: A compound in green tea
called EGCG may interfere with medications like anticoagulants and the cancer drug bortezomib.
THE BOTTOM LINE
Studies have found that green tea may reduce levels of LDL cholesterol, but the effect appears minimal.
a) Who are the individuals, or cases observed in this study?
b) Can you determine the intended population?
c) What variable(s) were recorded on each case?
d) What is the research question of the study?
e) What conclusion was reached? Are there any caveats / concerns raised in the study?
DePutron AP Statistics Unit I Packet: Exploring and Describing Data 8
Example 2. The data below came from a study from the US Census Bureau. The Bureau collected
information from every householder in the United States. The goals of the Census are to collect
accurate information about the US population, in order to make decisions about congressional
districts, allocating funds, and studying changes in the population.
a) Who are the individuals, or cases, in this study?
b) Identify the intended population in this study. Do you think the sample of individuals is
representative of the intended population? Why (not)?
c) Which of the following are variables collected on each individual? Justify.
i. Highest level of education completed by householder.
ii. Number of households
iii. Percent distribution by income level
iv. Income level
v. Less than 9th grade, some HS, …, bachelor’s degree or higher
vi. Under $10,000
d) What percentage of householders with less than a 9th grade education earned more than
$75,000?
e) What proportion householders with less than a 9th grade education earned more than $75,000?
f) How many householders with less than a 9th grade education earned more than $75,000?
Homework Assignment: On pg. 30, read questions P.19-22. For each question, identify:
1. The individuals, or cases (The “who.” )
2. The variables recorded for each case (The “What.”) 3. The intended population of interest.
Bonus: Research your own article, print it out, and identify these same three important parts of the study.
DePutron AP Statistics Unit I Packet: Exploring and Describing Data 9
Chapter 1: Displaying Distributions with Graphs Good graphs are extremely powerful tools for displaying large quantities of complex data; they help
turn the realms of information available today into knowledge. But, unfortunately, some graphs
deceive or mislead. This may happen because the designer chooses to give readers the impression
of better performance or results than is actually the situation. In other cases, the person who
prepares the graph may want to be accurate and honest, but may mislead the reader by a poor
choice of a graph form or poor graph construction.
The following things are important to consider when looking at a graph:
1. Title
2. Labels on both axes of a line or bar chart and on all sections of a pie chart
3. Source of the data
4. Key to a pictograph
5. Uniform size of a symbol in a pictograph
6. Scale: Does it start with zero? If not, is there a break shown
7. Scale: Are the numbers equally spaced?
1. Can you explain why these graphs might be misleading?
a. b. c.
2. What do you notice is different about the graphs below compared to the graphs from numbers
1a. and 1c.?
a. b.
DePutron AP Statistics Unit I Packet: Exploring and Describing Data 10
Stem-and-Leaf Plots A researcher chose a sample of 27 young adult women (ages 18-25) at a health clinic and recorded their resting pulse rates, in beats per minute. The pulse rates are displayed below in
a stem-and-leaf plot (also called a stemplot):
The 8 | 8 you see in the top row represents a woman whose pulse rate was 88 beats per minute.
a) Why do you think there are there two “6” stems two “7” stems, two “8” stems? Why not
use one stem each?
b) Three young women in the sample were not placed on the stem-leaf plot. Their pulse rates
are 79, 63, and 51 beats per minute. Properly put them on the plot above.
c) On the left side of the stems, create a back-to-back stem plot for the pulse rates of this
sample of 20 older women who also had their pulse rates recorded. Here are the data, in
BPM:
51 57 63 52 57 65 71 83 70 86 62 62 65 60 57 73 58 64 67 71
d) Based on this display we can make comparisons between the pulse rates of older women
and younger women. SOCS [Shape, Outliers, Center, Spread]
A. Shape and Outliers: Discuss the overall shape and any outliers.
B. Center: Which group has a higher center? What does this mean in this study?
Which measure of center is most appropriate?
C. Spread /variability: Which group is more spread out? What does that mean in
this study?
e) Take your own pulse, record your data on the class stemplot (it should be on the board
already).
DePutron AP Statistics Unit I Packet: Exploring and Describing Data 11
Histograms: Construct a histogram of the distribution of your classmates pulse rates on the grid below.
Pulse Rates
1. What characteristics [SOCS] of the distribution are evident from the histogram?
2. Compared to the stem-and-leaf plot, what details does the histogram lack?
3. When would it be beneficial to use a histogram rather than a stem plot?
Notes regarding shape: A distribution is said to be skewed to the right if it extends further to the right than it does to the
left. (The tail extends to the right)
A distribution is said to be skewed to the left if it extends further to the left than it does to the
right. (The tail extends to the left)
A distribution is said to be symmetric if the right and left sides of the histogram are approximately
[use your judgment] mirror images of each other.
DePutron AP Statistics Unit I Packet: Exploring and Describing Data 12
Practice Reading a Histogram Practice Example: Last year, a group of AP Statistics students measured how long they could
balance on their toes with their eyes closed. Results are in seconds.
a) About how many students performed this test? How do you know?
b) What does the bar in the middle of the histogram tell you, in context of the data?
c) Suppose Jamey wants to use this histogram to know the proportion of the participants that
balanced for more than a minute. How would you help Jamey?
d) Here’s another histogram of the same data. In what specific ways does this histogram improve
on the previous one?
e) Here’s a third histogram of the same data. Is this an improvement? Explain.
DePutron AP Statistics Unit I Packet: Exploring and Describing Data 13
Roll until “doubles” A game of chance is played in which two dice are rolled until “doubles” are rolled. A trial consists of
a sequence of rolls terminating with a roll of “doubles”.
1. On average, how many times do you think you’ll have to roll two dice to get doubles?
2. Record the calculator procedure:
3. Construct a histogram of the number of rolls until “doubles” are rolled. Use the calculator to
simulate 30 plays of the game.
Trial Rolls Trial Rolls
1 16
2 17
3 18
4 19
5 20
6 21
7 22
8 23
9 24
10 25
11 26
12 27
13 28
14 29
15 30
DePutron AP Statistics Unit I Packet: Exploring and Describing Data 14
Construct the histogram on the grid below. Be sure to label and scale the axis and title the graph.
4. Describe the SOCS for this distribution.
5. Locate the mean and median for this distribution. Which is larger? Why?
6. Reconstruct your histogram in your calculator:
STAT>1:Edit…>L1> input your data > 2nd:STAT PLOT>1:PLOT1…ON>select the histogram picture,
confirm XList is set to L1 > ZOOM>9:ZoomStat
Go to your window screen and adjust the settings. Notice how the shape of your histogram changes.
Let’s play a game…
If you can roll the dice 6 times without rolling “doubles”, I will give you $1. However, if “doubles” are rolled on the first through 6th roll, you pay me $1…Who wants to play?
DePutron AP Statistics Unit I Packet: Exploring and Describing Data 15
Make observations of the 3 graph types. For each type consider how it displays the
center and spread, how it captures individual data, the ease to create it, etc.
DOT PLOTS
STEMPLOTS
HISTOGRAMS
Examining Distributions
Statistical Language for describing the shape [overall pattern] of a distribution SOCS – always discuss these 4 things for 1-variable data Shape: Is the distribution Unimodal, bimodal, or uniform? If unimodal: symmetric, right-skewed
(positive skew), or left-skewed (negative skew)
Modes: major peaks or clusters in a graph
*If bimodal – describe the center and spread of each group separately
Outliers: Identify any deviations / outliers from the big picture. Talk about them in context of the data.
Speculate/ explain why they are outliers.
Center: Identify the overall center, or middle, of the distribution. Put your comments in context of the
data.
Numerical Measures of center: mean, median, midrange, mode
Spread or Variability: Identify how spread out the data are from the center. Put your comments in
context of the data.
Numerical Measures of spread: standard deviation, IQR, range
DePutron AP Statistics Unit I Packet: Exploring and Describing Data 16
Example1: A large grocery store is interested in purchasing watermelons from local farms. They
collected a random sample of watermelons from each farm (labeled A-F), and weighed each
watermelon. They constructed dot plots of the weights of the watermelons from each sample. Dot
plots are below.
Discuss with your classmates: Do not do any computations. There can be alternate conclusions
for some of the questions.
i. On average, which farm(s) appear to produce the heaviest watermelons? The lightest watermelons?
ii. Which would you select as a grocer? Why?
iii. Overall, which farm(s) appear to have watermelon weights that are the least variable?
iv. Which farm has watermelons whose weight that are most variable?
v. What distinguishes farm E from the other five farms?
DePutron AP Statistics Unit I Packet: Exploring and Describing Data 17
vi. What distinguishes farms C and D, from Farms A, B, and F?
vii. Give a specific feature of farms C and D that’s different from farms A, B, and F.
viii. Why might the grocer prefer to buy from Farm F over the others? Any concerns? Example 2: Match the following variables with the histograms and bar graphs given below.
Hint: Think about how each variable should behave. Where along the scale should values pile up?
(a) The SAT math scores of people in a college statistics class (b) Harvard Westlake Students’ response to “do you have your cell phone with you?” (c) Number of siblings of individuals in a group of 100 HW Seniors (d) Amount paid for last lunch out by students in this class (e) Gender breakdown of students in a college biology class. (f) Grades on an easy January exam in Algebra 2.
DePutron AP Statistics Unit I Packet: Exploring and Describing Data 18
guess_minus_actual
-10 -8 -6 -4 -2 0 2 4 6 8
DayOneSurvey2009 Dot Plot
school "loomis"=
NUMERICAL SUMMARIES OF CENTER AND SPREAD Measures of Center
ARE YOU SMARTER THAN A MIDDLE SCHOOLER? MEAN, MEDIAN, RANGE CHALLENGE 1. Karl has eight people in his family. He wondered how many hours of TV each of them might have watched in a week if the mean of the eight values were 5 hours. Write down eight amounts of TV-watching that have a mean of five hours: 2. When Karl told his father what he had done, his father wondered how the values might change if the mean were five hours and the median were four hours. Write down eight values that have a mean of five hours and a median of four hours: 3. Karl’s mother challenged him to write down eight values with a mean of five hours, a median of four hours, and a range of seven hours. Write down eight possible amounts of TV-watching that have a mean of 5 hours, a median of 4 hours, and a range of 7 hours: NOTES Consider… • What’s typical in the group? • What value do the observations center/cluster themselves around? • At which location is the data split into an upper half and a lower half? Median: The median is the value for which half of the observations in the set are greater than and half of the observations are less than half. The median will divide a histogram into equal areas. To find the median:
1. Arrange the observations in increasing order 2. If the number of observations is odd, the median is the middle value 3. If the number of observations is even, the median is the average of the middle two.
To find the location of the median:
(n + 1) ÷ 2
Median Example: Twenty-one students guessed the weight of their backpacks, and then weighed
their backpacks. The variable “guess-actual” records the guessed weight minus the actual weight.
Here’s a dot-plot:
1. Find and interpret the median of this
distribution in context.
DePutron AP Statistics Unit I Packet: Exploring and Describing Data 19
Mean (x bar): The mean, x bar, is the computed average of the set of observations:
x bar = (x1 + x2 + … xn) / n
or in sigma notation
x bar = 1/n ∑xi
1. Using the data from our day one survey, find the median and mean years of “minutes to get to school”. 2. Which measure of center is larger? Why?
DePutron AP Statistics Unit I Packet: Exploring and Describing Data 20
guess_minus_actual
-10 -8 -6 -4 -2 0 2 4 6 8
DayOneSurvey2009 Dot Plot
school "loomis"=
Measures of Spread 1. Range = maximum – minimum 2. Interquartile Range (IQR) = Q3 – Q1 It tells you the width of the central 50% of the distribution. Quartiles: The first quartile (Q1) is the value for which 25% of the observations are less than. It is the MEDIAN OF THE LOWER HALF of the set of observations. The third quartile (Q3) is the value for which 75% of the observations are less than. It is the MEDIAN OF THE SECOND HALF of the set of observations.
IQR is typically used to describe spread when Median is used to describe center. On the AP Exam, if you decide to match median and IQR be sure to explain that your decision is based on the “norm”
a) Is the IQR a measure of center, spread, or shape? Why?
b) Compute and interpret the IQR for the
“guessed minus actual” data.
c) Quartiles are like fences that divide the
data into four groups of equal size…
Where’s the second quartile, Q2?
Five number summary: min, Q1, median, Q3, max.
Describe the calculator procedure: Outliers: An observation is called an outlier if it lies more than 1.5 * IQR above Q3 or below Q1. 3. Variance (s2): The variance is roughly the average of the squared differences between each observation and mean. Also worded: “on average, how much does each observation vary from the mean”. Variance is measured in square units.
s2 = (x1 - x bar)2 + (x2 - x bar)2 + … +(xn – x bar)2 / n-1
or in sigma notation
s2 = 1/(n-1) ∑(xi – x bar)2
4. Standard deviation (s): The standard deviation is the square root of the variance. This measurement is in the same units as the original data.
DePutron AP Statistics Unit I Packet: Exploring and Describing Data 21
s = √(1/(n-1) * ∑(xi – x bar)2)
variance (s2) and standard deviation (s) are used to measure spread when the mean (x bar) is used to describe center.
When the distribution is approximately symmetric, the mean (x bar) and standard deviation (s) are generally used to summarize the distribution. If the distribution is skewed, a five number summary is generally used. CHALLENGE
There are twenty players with golf handicap between 6 and 32
(6,6,8,9,10,11,13,15,15,16,18,19,21,23,26,27,29,30,31,32) The problem is to divide the twenty players into four teams of five players each, with the lowest possible variance and standard deviation between the teams.
DePutron AP Statistics Unit I Packet: Exploring and Describing Data 22
Common measures of center Common measures of spread
Median: a value that separates
data into top half, bottom half
Inter-quartile range: Q3 –Q1
Mean: 1 2 1
n
i
n i
yy y y
Yn n
Standard Deviation: 2( )
1
ix X
n
Midrange = min max
2
Range: max min
There are many methods to numerically summarize center or spread. Any number computed from a set of data is called a statistic.
The standard deviation: The average distance of an individual value from the mean
But Before we begin … Example: The following histograms display quiz scores on a scale of 1-9 for two different statistics
classes.
a) Between Classes F and G, which set of ratings exhibits more variability (greater spread)?
Explain.
b) One way to measure variability is to compare the “average distance of data from the center of
the distribution.”
Using this idea, order the ratings from least variable the most variable. Justify.
DePutron AP Statistics Unit I Packet: Exploring and Describing Data 23
Summation Practice ∑
75, 76, 82, 93, 45, 68, 74, 82, 91, 98 Calculate (remember Order of Operations) and then double check wit the calculator [STAT> CALC> 1 Var Stats]
X values X2 (X-2)
∑X = ∑X2
= (∑X)2
= ∑(X-2) =
WARM-UP 2005B #1 Remember, each question is worth a total of 5 points. 1. The graph below displays the scores of 32 students on a recent exam. Scores on this exam ranged from 64 to 95 points.
6 | * *
6 | * *
7 | * * *
7 | * * * *
8 | * * * *
8 | * * * * * *
9 | * * * * * * *
9 | * * * *
(a) Describe the shape of this distribution (b) In order to motivate her students, the instructor of the class wants to report that, overall, the class’s performance on the exam was high. Which summary statistic, the mean or the median, should the instructor use to report that overall exam performance was high? Explain.
DePutron AP Statistics Unit I Packet: Exploring and Describing Data 24
(c) The midrange is defined as (maximum + minimum)÷2. Compute this value. Is the midrange considered a measure of center or a measure of spread? Explain.
The Mean and Median 1. Go to www.whfreeman.com/tps3e : Statistical Applets : Mean and Median 2. Place two observations on the line by clicking below it. Why does only one arrow appear? 3. Place three observations on the line by clicking below it, two close together near the center of the line and one somewhat to the right of these two.
a. Pull the single rightmost observation out to the right. (Place the cursor on the point, hold down the mouse button, and drag the point.) How does the mean behave? How does the median behave? Explain briefly why each measure acts as it does. b. Now drag the rightmost point to the left as far as you can. What happens to the mean? What happens to the median as you drag this point past the other two? (Watch carefully.)
4. Place five observations on the line by clicking below it. a. Add one additional observation without changing the median. Where is your new point? b. Use the applet to convince yourself that when you add yet another observation (there are now seven in all), the median does not change no matter where you put the seventh point. Explain why this must be true.
DePutron AP Statistics Unit I Packet: Exploring and Describing Data 25
Example 1: New York Yankee Salaries A histogram of the 2010 salaries for every player on the
New York Yankees is displayed below. The mean salary and median salary are also marked.
Which marking is the median? Which one is the mean? Why?
Example 2: In which of the four histograms below are the mean and the median roughly equal?
In which is the mean considerably greater than the median? In which is the mean considerably
less than the median?
Summarize:
In a skew left distribution, the mean tends to be _________ the median.
In a skew right distribution, the mean tends to be __________ the median.
In a symmetric distribution, the mean tends to be __________ the median.
DePutron AP Statistics Unit I Packet: Exploring and Describing Data 26
salary (thousands)
10 20 30 40 50 60 70
Collection 2 Dot Plot
age
24 26 28 30 32
Collection 1 Dot Plot
Example 3: Observe the salaries for all the employees in this small ad agency:
Occupation: Salary: (dollars per month)
Owner $ 60,000 VP $ 40,000 Senior Agent $ 22,000 Senior Agent $ 16,000 Senior Agent $ 14,000 Senior Agent $ 12,000 Senior Agent $ 10,000 Junior Agent $ 10,000
a) Here’s a dot plot of the distribution of weekly salaries. Without doing calculations, what
would you call a typical, or “average” salary in this company?
b) Compute:
Median of salaries: Mean of salaries:
Midrange of salaries = (min + max) /2:
Which measure of center seems to coincide best with your answer in a)? The mean, median,
or midrange? Why don’t the other values work as well?
c) Suppose that the owner decides to double his own salary, and leave the rest unchanged.
Record the new values for each measure of center. Which measure of center stays the
same (MOST resistant to outliers)? Which one is LEAST resistant to outliers?
d) The Federal government collects 7.65% of each worker’s earnings for Social Security. If
you want to know the total amount collected from the agency for Social security in a month,
which measure of center should be used: mean, the median, or the midrange? Justify.
MEASURE DISADVANTAGES… BEST TO USE WHEN…
Mean: average value
Sensitivity to the influence
of a few extreme
observations, NOT a
resistant measure
Include every piece of
collected data
Data set is not highly
skewed
Median: middle value
Does not incorporate every
piece of data
Data set includes outliers, a
resistant measure
Looking to describe the
“typical” value of a data set
Midpoint: informal middle
value
Not very accurate of the
entire set of data
A quick idea of the center
Computing Standard Deviation
DePutron AP Statistics Unit I Packet: Exploring and Describing Data 27
Example A researcher recorded the ages of a random sample of five first – year medical students at a local university. The mean of this sample is 26 years old x bar!
a) Individual ages vary from the mean age of 26 years. By how much, on average? Estimate.
To answer this question more mathematically, we can compute the standard deviation.
Standard deviation:
s =
2( )
1
ix X
n
Interpretation: roughly, it’s a measure of the average distance of an individual from the mean of
the sample.
This formula is usually a little irritating to calculate by hand, but we’ll do it now to
make sure we understand it. Sum up the squared deviations.
b) To find
the mean
squared
deviation,
we’d divide
by n, the
sample size.
In some
situations,
this works,
but for
samples, statisticians prefer to divide by n-1. Why? Excellent question. More on this later in the
course.
“Average” squared deviation= 2s = 2
( )
1
ix X
n
= _______________
This value, symbolized by2s , is called the variance of the ages. It represents (roughly) the
average squared deviation from the mean. The units of this number are “years squared.”
c) To put this back into “Years,” we take the square root of the variance.
Standard deviation = s =
2( )
1
ix X
n
= ______________
d) Which data value above contributed most to the value of the standard deviation? Why?
e) Which measure of spread is more resistant to outliers: the standard deviation or the IQR?
L1 = ix , the data L2 = ( )ix X ,
the deviations
L3 = 2( )ix X ,
the squared deviations
23 24 24 27 32
∑(2( )ix X =
DePutron AP Statistics Unit I Packet: Exploring and Describing Data 28
More questions to help understand the standard deviation (The middle school challenge –
but for measures of spread):
a) You have twelve quiz scores. Each score is an integer from 0 to 10. You are told that the
standard deviation of the scores equals zero. What is one possibility for your set of data? Are
there other possibilities? What MUST be true about the set of data you have?
b) Suppose the IQR for a set of twelve quiz scores = 0. Do all the scores have to be the same?
Justify.
c) Earlier, you saw how the std. deviation of a sample could be zero. Take four observations from
0 to 10. Make their standard deviation as LARGE as possible.
Example: In Mrs. Bain’s running group (about 10 folks), the mean 5K time for all runners on their last run is about 25 minutes, with a standard deviation of about s = 3.6 minutes.
a) Mrs. Bain was the slowest person in her running group this weekend. She ran the 5k in 35
minutes. If she were removed from the data set, how would the STANDARD DEVIATION
change: (go up, go down, stay the same?) Justify.
b) How would the IQR change? Justify.
Summarizing numerical summaries: We’ve learned many different ways to summarize quantitative data:
mean x median M
3 1IQR Q Q .std deviation s
The RANGE = max min MIDRANGE: = max min
2
MOST resistant to
skewness /
outliers
LEAST resistant
to skewness/
outliers.
Measure of center
Measure of spread
Always plot your data before using numerical summaries.
DePutron AP Statistics Unit I Packet: Exploring and Describing Data 29
When the data look roughly unimodal and symmetric, it’s fine to compare means and compare
standard deviations.
If you see skewness or outliers in data, you should compare medians and compare IQR’s.
Summarizing summary Statistics, part 2
Statistics are used as numerical indicators of:
Center - the “middle,” or “average” of a distribution
Spread – how widely dispersed are your data?
Location in a distribution - Where among the sample does a particular value fall ? (Note: measures
of center are also measures of location)
Shape - can indicate the presence of symmetry / skewness in a distribution.
Exercise: For each formula, determine / justify if it’s a measure of center, spread, location, or
shape (specifically, symmetry vs. skewness).
a. 1 3
2
Q Q
b. 95 5th percentile th percentile
c. mean median
d. First ignore the highest two and lowest two observations in the sample. Then from the
rest, compute sumof remaining obs
number of remaining obs
e. 75th percentile
f. 1
n
i
i
x median
n
DePutron AP Statistics Unit I Packet: Exploring and Describing Data 30
Describing a distribution of a quantitative variable. Identify the overall center, or middle, of the distribution. Put your comments in context of the data.
Numerical Measures of center: mean, median, midrange
Identify how spread out the data are from the center. Put your comments in context of the data.
Numerical Measures of spread: standard deviation, IQR, range
Identify the shape of the distribution, and place your comments in context of the data.
Ways to describe: Unimodal, bimodal, or uniform? If unimodal, symmetric or skewed?
Identify any deviations / outliers from the big picture. Talk about them in context of the data.
Speculate/ explain why they are outliers.
Things to remember:
When comparing two groups, be sure to you make actual comparisons, and not simply
“parallel” descriptions of each distribution. In other words, which group has a higher center?
Which group shows more variability?
When comparing centers or spreads, be sure you use the same measurement on each
group. In other words, don’t compare the median of group A with the mean of group B.
Compare the same measure of center in each group. Same goes with spreads. In other
words, make sure you compare “apples to apples, not apples to oranges.”
Always make comparisons in context of the data.
I
DePutron AP Statistics Unit I Packet: Exploring and Describing Data 31
Box Plots or “Box-and-Whisker Plots”
Find each of the following for the distribution of Cost of Last Hair Cut (page 11):
Q1 =
Q3 =
IQR =
Five Number Summary =
a) Are there any outliers in the distribution of Cost of Last Hair Cut?
b) Complete the table to find variance and standard deviation:
Participant # x (x – x bar) (x – x bar)2
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
∑
∑(xi – x bar)2 =
DePutron AP Statistics Unit I Packet: Exploring and Describing Data 32
c) Which individual contributed the most (aka which data points are furthest from
the mean)?
Calculate
Variance = s2 = 1÷(n-1) • ∑(xi – x bar)2 = ____________________
Standard Deviation = s = √(s2) = _________________________
d) Which would be more appropriate in describing the distribution the Cost of Last Haircut – a five number summary or the mean and the median? Why?
e) What are the units of measurement for each:
Variance: ___________________
Standard Dev: _______________
f) Construct a boxplot for the Cost of Last Haircut using the grid as a guide:
g) Construct parallel boxplots for the Cost of Last Haircut for men and women using the grid a guide:
DePutron AP Statistics Unit I Packet: Exploring and Describing Data 33
Using the boxplots above, compare and contrast the distributions of the Cost of
Last Haircut for men and women (be sure to use descriptive language and support your responses with data):
DePutron AP Statistics Unit I Packet: Exploring and Describing Data 34
Effective communication in comparing distributions:
When comparing the distribution of two groups, clear communication is key. Here are some examples of how to improve poor communication in a problem. Let’s examine the following responses from a problem about tomatos! 1. INCLUDING SUFFICIENT CONTEXT. WEAK: “Group C had a median of about 31, and group E’s median is higher, around 35.” Issue: Our job is to describe the group of individuals in the study. Describing graphs without context doesn’t tell the reader what the numbers mean in this setting. BETTER: The control group showed an average increase of 31 inches, but the experimental group’s average increase was larger, about 35 inches. 2. LACK OF COMPARISON / PARALLEL Descriptions without comparison WEAK: “The control group showed an average increase of 31.2 inches, while the experimental group’s average was 31.7 inches.” Issue: Is this a major difference or not? Are you claiming they’re pretty similar, or different? It’s unclear. The “while” doesn’t communicate your conclusion thoroughly. Better: “The control group showed an average increase of 31.2 inches, but the experimental group’s average was slightly higher (31.7 inches.)” 3. NOT COMPARING SIMILAR QUANTITIES to compare centers or spreads. WEAK: “The control group showed less variability than the experimental group. The range of increases spanned 12 inches for the control group, but the standard deviation in the experimental group was 6.” Issue: You are using different ways to measure spread in each group. An effective comparison of spreads uses the same measurement for both groups. BETTER: “The control group showed less variability than the experimental group. The range of increases spanned 12 inches for the control group, but the range of increases for the experimental group was 16 inches. 4. VAGUE / UNCLEAR terminology: WEAK: “The control group’s increases are well-distributed evenly, but the experimental group is skewed.” Issue: “well distributed” might mean different things to different readers. For shapes, learn the terminology for each kind of shape. Better: The control group’s increases are symmetric, but the experimental group is skewed to the left.
DePutron AP Statistics Unit I Packet: Exploring and Describing Data 35
2002 Form B #5
At a school field day, 50 students and 50 faculty members each completed an
obstacle course. Descriptive statistics for the completion times (in minutes) for the two groups are shown below.
Students Faculty Members
Mean 9.90 12.09
Median 9.25 11.00
Minimum 3.75 4.50
Maximum 16.50 25.00
Lower quartile 6.75 8.75
Upper quartile 13.75 15.75
(a) Use the same scale to draw boxplots for the completion times for students and for faculty members.
(b) Write a few sentences comparing the variability of the two distributions.
(c) You have been asked to report on this event for the school newspaper. Write a
few sentences describing student and faculty performances in this competition for
the paper.
Linear Transformations
When every value of the variable x is transformed into a new value xnew given by the equation: xnew = a + bx.
Original Data (x) Median Mean Range IQR St Dev Variance
3, 4, 6, 8, 12,15, 20
Add 4 to each value in the original data and complete the table: xnew = 4 + x Median Mean Range IQR St Dev Variance
7, 8, 10, 12, 16, 19, 24
What changed? What didn’t? Why (both mathematically and conceptually)?
Multiply each value in the original data by 3 and complete the table:
xnew = 3x Median Mean Range IQR St Dev Variance
DePutron AP Statistics Unit I Packet: Exploring and Describing Data 36
9, 12, 18, 24, 36, 45, 60
What changed? What didn’t? Why (both mathematically and conceptually)?
Multiply each value in the original data by 2 and add 3 and complete the table:
xnew = 3 + 2x Median Mean Range IQR St Dev Variance
9, 11, 15, 19, 27, 33, 43
What changed? What didn’t? Why (both mathematically and conceptually)?
How is each summary statistic of x affected by the linear transformation xnew = a
+ bx?
Mediannew = IQRnew =
Meannew = St Devnew =
Rangenew = Variancenew =
Suppose a teacher gave a test for which x bar = 70 and s = 21. He wants to apply a linear transformation xnew = a + bx to “scale” the grades so that x barnew = 82
and snew = 7. Find a and b.
Which students are happy with the grade adjustment? Who is mad? Who wasn’t
affected?
DePutron AP Statistics Unit I Packet: Exploring and Describing Data 37
2004 #1 1. A consumer advocate conducted a test of two popular gasoline additives, A and B.
There are claims that the use of either of these additives will increase gasoline mileage in cars. A random sample of 30 cars was selected. Each car was filled with gasoline and the
cars were run under the same driving conditions until the gas tanks were empty. The distance traveled was recorded for each car.
Additive A was randomly assigned to 15 of the cars and additive B was randomly assigned to the other 15 cars. The gas tank of each car was filled with gasoline and the assigned
additive. The cars were again run under the same driving conditions until the tanks were empty. The distance traveled was recorded and the difference in the distance with the additive minus the distance without the additive for each car was calculated.
The following table summarizes the calculated differences. Note that negative values
indicate less distance was traveled with the additive than without the additive.
Additive Values
Below Q1
Q1 Median Q3 Values
Above Q3
A -10, -8, -2 1 3 4 5, 7, 9
B -5, -3, -3 -2 1 25 35, 37, 40
(a) On the grid below, display boxplots (showing outliers, if any) of the differences of the two additives.
(b) Two ways that the effectiveness of a gasoline additive can be evaluated are by looking at either
The proportion of cars that have increased gas mileage when the additive is used in
those cars or
The mean increase in gas mileage when the additive is used in those cars i. Which additive, A or B, would you recommend if the goal is to increase gas mileage
in the highest proportion of cars? Explain your choice.
ii. Which additive, A or B, would you recommend if the goal is to have the highest mean increase in gas mileage? Explain your choice.
Use Standardization to compare LOCATIONS
DePutron AP Statistics Unit I Packet: Exploring and Describing Data 38
Suppose that UC Berkeley’s admissions office needs to compare scores of students
who take the SAT with those who take the ACT. Suppose that among the college’s applicants who take the SAT, scores have a mean of 2020 and a standard
deviation of 174. Further suppose that among the college’s applicants who take the ACT, scores have a mean of 29 and a standard deviation of 5.2.
(a) If applicant Bobby scored 2115 on the SAT, how many points about the SAT
mean did he score?
(b) If applicant Adeline scored 32 on the ACT, how many points above the mean
did she score?
(c) Is it sensible to conclude that since your answer to (a) is greater than your
answer to (b), Bobby outperformed Adeline on the admissions test? Explain.
(d) Determine how many standard deviations above the mean Bobby scored. To
do this: divide your answer to (a) by the standard deviation of the SAT scores. Then do the same for Adeline.
Once you have a sample’s mean and standard deviation, you can generalize the behavior of individual data values. Also, you can make comparisons of individual values from different distributions. One calculates a z-score, or standardized
score by subtracting the mean from the value of interest and then dividing by the standard deviation. These z-scores indicate how many standard deviations above
(or below) the mean a particular value falls. One should use z-scores only when working with mound-shaped distributions that are roughly symmetric.
(f) Which applicant has the higher z-score for his or her test score? In your own
words, explain who performed better.
(g) The mean SAT score in 1998 was 1013 (there were only two parts) with a Standard Deviation of 100 points. Mrs. Bain’s friend (Rowdy) scored 885, who did
they compare to Bobby and Adeline?
DePutron AP Statistics Unit I Packet: Exploring and Describing Data 39
colle
ge
four
year
sold
0 20 40 60 80 100 120 140 160 180 200 220
Collection 1 Dot Plot
(h) Under what conditions does a z-score turn out to be negative?
Standardized Values and Z-Scores
If x is an observation from a distribution that has a known mean and standard
deviation, the standardized value of x is:
z = (x – x bar) ÷ standard deviation
STANDARDIZATION
Scenario: Below are the distributions of heights (in centimeters) in a population of male high-
school students, and the heights in a population of male 4-year olds:
Mrs.
Bain’s
cousi
n Kyle
is going to college. He’s very tall:
191 cm tall. Mrs. Bain also has a 4-
year-old neighbor, named Joey. He is also very tall for his age: 112 cm. Mrs. Bain wants to know
which one is taller, relative to their group of peers.
a) By looking at the displays of heights in each population, which one appears to be taller,
relative to their peer group? Provide evidence.
b) How many cm above the mean height for college guys is Kyle?
c) How many cm above the mean height for 4-year-olds is Joey?
d) Which group, college students or 4-year olds, shows more variability in their heights?
Evidence?
e) How many standard deviations above the mean height are JD and Nick? Show the
computations. Based on this measurement, which relative is taller relative to his peers?
Height(cm) Mean Std. Dev’n
college 180 10
4-year 100 6
DePutron AP Statistics Unit I Packet: Exploring and Describing Data 40
f) Consider the set of heights for this population of college men. The mean of all heights is 180
cm. Suppose that we subtract “180” from each height in the population. What new name could
we give to this transformed variable?
g) Now, suppose that after subtracting, we now divide each value by 10, the standard deviation.
How does the standard deviation change?
h) If we converted each college height into a z-score, what are the mean and standard deviation
of the z-scores? Shape?
Mean:________ SD:_______________ shape:__________
i) Take a guess: what happens when we convert the 4-year old heights into z-scores: What is the
new mean and SD? Shape?
Mean:_________ SD:_______________ shape:_________
j) How does standardizing change the (a) shape, (b) center, and (c) spread of a distribution?
k) Explain the purpose of converting two groups of data into z-scores.
DePutron AP Statistics Unit I Packet: Exploring and Describing Data 41
Density Curves and Normal Distributions Suppose we looked at an exam given to a large population of students. The histogram of this data appears like the graph to the left below. However, rather than show how many data values are in each bin, we now look at the bin height in terms of percentage. This histogram is symmetric and the tails fall off from a single peak in the center (unimodal). There are no obvious outliers. We now draw a smooth curve through the tops of the bins to get a better picture of the overall pattern of the data (right graph). This curve is an approximate mathematical model for the distribution. We will find that it is easier to work with this model than the actual histogram. We call this curve a density curve. Since the height of the histogram is in terms of percent, it should be obvious that the total area under the histogram should be exactly 1 (as 100% of the scores are shown). In other words, if you added up the heights of all the bars, it would equal 100%
In the graph to the right, a density curve is
shown. It appears to have data values from 0
to 10 and is skewed right. The shaded area
under this density curve between 2 and 4 is .3109.
That is saying that 31.09% of the data lies
between 2 and 4.
DENSITY CURVE NEEDS TO KNOW
1. Any curve above the x-axis that has a total area equal to 1 is considered a Density
Curve.
2. The MEDIAN of a density curve is called the “equal areas point”. The MEAN is called
the “balance point”. Why do you think they got those nicknames?
DePutron AP Statistics Unit I Packet: Exploring and Describing Data 42
3. A Density Curve is used to describe the location of individual observations within a
distribution
4. Notation: because a Density Curve usually includes all the data of a population
(versus just a sample of the population), use Greek letters to denote the mean (µ: mu) and standard deviation (o: sigma)
Good trivia night fact: English letters are used for statistics (x bar, s, s2 – measures of a
sample) and Greek letters are used for parameters (∑, µ, etc – measures of the
population)
5. SKEWNESS in a density curve affects the mean and median the same as with a
histogram:
Right skewed: mean ___ median
Left skewed: mean ___ median
Symmetric curve: mean ___ median
DePutron AP Statistics Unit I Packet: Exploring and Describing Data 43
A measure of LOCATION: Percentiles
Lets stick with College Entrance Exams. If you’ve
taken a close look at your SAT or PSAT scores, then
you may have noticed your percentile score. For
example, the 98th percentile is a score that separates
the bottom 98% of the cases from the top 2%.
The table to the right shows approximate percentiles
for scores on each section of the SAT in 2006. These
numbers describe the group of college-bound senior
males who graduated in the 2006-2007 school year.
How to read: A male college-bound student scoring
600 on the critical reading score higher than about
78% of his peers. Correspondingly, about 22% of his peers scored higher than him.
a) Kiko’s writing score is higher than 90% of other
males like himself. Estimate his score. Explain your reasoning.
b) On which test are males’ scores the highest, on average? Provide specific evidence for
your decision.
c) Construct box plots comparing the scores of boys on the math and the writing section of the
SAT. Compare the distributions in context.
DePutron AP Statistics Unit I Packet: Exploring and Describing Data 44
Density curves shaped like a bell are EVERYWHERE!
You may have noticed that the visual displays of some real world variables follow very similar patterns. This pattern is called a bell-curve or a Normal Distribution:
Grades for my regular statistics
class:
Or this histogram of SAT Writing exam scores for 2006:
1. With your group, come up with 3 additional types of
data sets that you would expect to be normally distributed.
2. With your group, come up with 3 types of data sets that you would expect to NOT be normally
distributed.
This shape tends to crop up frequently when looking at
data distributions. It is also referred to as a “bell
curve”. Under the right circumstances, we can use
Normal density curves to “idealize” the shape of
variables like those above. We capitalize Normal to
remind you that these curves are special. The more
“normal” a distribution is, these relationships become
closer to being perfectly true. No distribution is
perfectly normal and therefore, these relationships are
approximations in most real-life settings.
Just like in any density curve, the area under the curve
DePutron AP Statistics Unit I Packet: Exploring and Describing Data 45
between two points “a” and “b” corresponds to the proportion of the data that
falls between “a” and “b.”
What’s another word for proportion? ____________________
Recall: what does the entire area under any density curve equal? ______________
NOTATION alert: On a Normal density curve, we use GREEK letters to designate the mean and the standard deviation ~ because we are talking about a population.
Mean of normal model for a variable X = X
Std. deviation of X: X
But with a data distribution, we use ROMAN letters (i.e. ENGLISH letters) to designate the mean and the standard deviation of a set of data:
Mean of data set for a variable X = X
Std. deviation of X: xs
For the graph at right, what is the
mean and standard deviation of this Normal Distribution?
Mean =
Standard Dev =
Does this curve represent an entire
population or just a sample? How can you tell?
For the
graph at left, number the curves from widest to narrowest spread (1 = widest).
The 68 – 95 – 97.5% Rule: A
property of ANY Normal curve
DePutron AP Statistics Unit I Packet: Exploring and Describing Data 46
A little something for our Academy of Finance kiddos…
The annual increase in value for each stock on the S&P 500 varies from stock to stock. Suppose
that in the year 2005, one economist (N. O. Wei) projected that the increase for all stocks
(measured in percentage points) will be normally distributed, with a mean of 5%, and a standard
deviation of 8%.
a) Sketch a normal density curve that can model the distribution of annual increases in
2005. Mark the values that correspond to the mean increase, and the increases that are
1, 2, and 3 SD away from the mean.
According to this economist’s predictions [sketch a normal density curve for each
justification]….
b) The top 2.5% of all stocks will show an increase of at least how much?
c) How many stocks will show a loss of at least 3% last year?
d) What percent of all stocks will show an increase from 13% to 21%?
e) Would it be likely to find a stock that gains more than 29% in a year?
DePutron AP Statistics Unit I Packet: Exploring and Describing Data 47
f) We would expect the top 2.5% of all stocks to gain how much in a year?
g) Suppose we want to know what percentage of stocks would gain in value. Shade in the
region we’re looking for.
Understanding the STANDARD NORMAL, or Z-Model
Recall the purpose of standardized scores, or calculating z-scores. By
standardizing, we can rescale any set of data so its mean equals 0 and its standard deviation
equals 1. When a histogram of real-world data is unimodal and symmetric, then normal
curves often are used to describe the distribution shown in the histogram.
Facts about the normal model How this applies to data that is approximately normal in its shape:
The mean equals 0, standard deviation is 1.
µ = 0 X= 1
When you convert original data values to z scores, The mean z-score will be 0, and the standard deviation will be 1.
0Z 1Zs
About 68% of the area under the curve falls between -1 and 1.
About 68% of the data will be within one standard deviation of the mean.
About 95% of the area under the curve falls between -2 and 2.
About 95% of the data will fall within 2 standard deviations of the mean.
About 99.7% of the area under the curve falls between -3 and 3.
Nearly all (99.7%) of the data will fall within 3 standard deviations of the mean.
Example 1: 4 runners run the New York marathon. We wish to determine which runner had a faster
time relative to his age group. Assume normality. Complete the chart.
DePutron AP Statistics Unit I Packet: Exploring and Describing Data 48
Best time ______ 2nd best time______ 3rd best time _______ Worst time ________
Example 2: Suppose you have a sample of data that can be modeled well with a normal curve.
Use the 68-95-99.7 rule to answer the following:
a) About what percent of the sample has a positive z-score? Show with a sketch of the normal
density curve.
b) About what percent of the sample has z-scores greater than 1? Show with a sketch of the
normal density curve.
c) About what percent are more than 2 standard deviations below the mean? Show with a
sketch of the normal density curve.
d) Suppose that heights for all 17 year old males can be modeled well with a normal curve.
Suppose we know that the US Navy will reject any young male whose height is more than 2
SD below or 1 SD above the mean height. What percent of young men are eligible for
admission to the Navy? (Note: this is a fake example. The situation I described is not true.)
e) Approximately what percent of the sample has z-scores greater than 1.5?
Using the TI-84 for normal calculations
DePutron AP Statistics Unit I Packet: Exploring and Describing Data 49
Suppose you want to find any area between a and b under a standard normal
curve. TI COMMAND: 2nd DISTR normalcdf (a, b) . Output = Area. 1) define variable of interest, x = ? 2) check if normal model appropriate
3) make a picture. Point out the GOAL in picture! 4) state goal, and re-phrase the goal in terms of z scores, use inequality signs where
appropriate 5) use TI-83 for calculation. 6) answer question in context
Men's heights are approximately normally distributed with a mean of 70 inches and a standard deviation of 2.6. Example 1: About what proportion of men are between 67 and 76 inches? Set up (pg. 142 in your book): 1) Define variable of interest. Let x = height. 2) Check if normal model appropriate. We’re told men’s heights are approximately normal. We can use the normal model. 3) Make a picture. Point out the GOAL in picture! 4) State goal, and re-phrase the goal If 67 < heights < 76, in terms of z scores. then z-scores -1.154 < z < 2.307 5) Use the TI to do the calculation. Enter 2nd DISTR normalcdf (-1.154, 2.307)
ENTER
6) Answer question in context. Area = .865. 86.5% of all men are between 67 and 76 inches.
Example 2: What proportion of men are less than 6 feet tall?
DePutron AP Statistics Unit I Packet: Exploring and Describing Data 50
1. Let X = height.
2. We already know a normal model is appropriate. Goal-find the proportion of men under 72 inches.
3. (picture)
4. If height < 72, then z < (72-70)÷2.6 z < .7692
5. To find the area on TI: 2nd DISTR normalcdf( -
9999, .7692) ENTER:NOTE: Since the area we want
has no lower limit, we substitute a large enough
number like "-9999" for negative infinity.
= .7791
6. Conclusion: about 77.9% of all men are under 6 feet tall.
Example 3: Find the proportion of men who are taller than 64 inches. Set up the problem draw the curve, show the steps, and use the TI-84 to find the
proportions. 1) define variable of interest, x = ? 2) check if normal model appropriate 3) make a picture. Point out the GOAL in picture!
4) state goal, and re-phrase the goal in terms of z scores, use inequality signs where appropriate
5) use TI-83 for calculation. 6) answer question in context
Answer: About 1.05% area under 64 inches tall.
Flip it and reverse it! Suppose that you know the area under a normal curve to the left of some location
“k”, and you want to find the corresponding z score:
DePutron AP Statistics Unit I Packet: Exploring and Describing Data 51
1) How is this different from what we were doing above?
COMMAND: 2nd DISTR invNorm (k). The output will be the z-score we
seek.
Example #4: If we assume the normal model, an observation at the 75th percentile has what z-score? Draw a picture indicating what we know, and what
we want. 1) define variable of interest.
2) check if normal model appropriate. 3) make a picture. Point out the GOAL in picture!
4) state goal, and re-phrase the goal in terms of z scores. 5) use TI-83 for calculation. 6) answer question in context
Example #5: The lowest 15% of all men are all under what height? 1) define variable of interest. 2) check if normal model appropriate. 3) make a picture. Point out the GOAL in picture!
4) state goal, and re-phrase the goal in terms of z scores. 5) use TI-83 for calculation.
6) answer question in context
Example #6: Only 0.35% of the population of adult males are taller than Mr.
Bain. According to the normal model, how tall is Mr. Bain?
DePutron AP Statistics Unit I Packet: Exploring and Describing Data 52
Answer: Mr. Bain is about 77 inches tall, or 6’ 5” .
Example #7: Joan found that her average telephone call last month was 9.6 minutes with a standard
deviation of 2.4 minutes. His telephone call usage was roughly normal. What percentage of her calls…
a. were less than 5 minutes? b. were less than 10 minutes?
c. were more than 15 minutes? d. were between 4 and 8 minutes?
Find the telephone call length that represents:
e. the 90th percentile f. the 98th percentile
g. the top 0.5 percent h. the bottom 1%
Example 8: 1) To the right is a density curve for distribution of values between 0 and 1.3. Use
geometry to determine the answers to these questions.
a. What percentage of the data lies below .6?
b. What percentage of the data lies above .5?
c. What percentage of the data lies between .3 and .9?
d. What percentage of the data is exactly .8?
e. What is the median of the data?
f. Draw the approximate position of the mean on the drawing More to think about: a) In a large set of normally distributed data, Q1 and Q3 correspond to which z-scores? (Hint, if
you get one of them, then you should be able to figure out the other one easily!)
DePutron AP Statistics Unit I Packet: Exploring and Describing Data 53
b) We have declared a value an “outlier” if it falls more than 1.5 IQR’s below Q1 or 1.5 IQR’s above
Q3. In a large set of normally distributed data, what proportion of values would be declared
outliers by this criterion?
c) Suppose that in a population of professional basketball players in the USA, Mr. Bain only ranks in
the 45th percentile. If the distribution of basketball player heights is also normally distributed, and
has the same SD as the population of all men (2.6 inches), then what is the mean height for this
population of players?
d) Suppose that among a population of Olympic-level swimmers, 10% can swim 100 meters in
under 49 seconds, while 6% require 55 seconds or more to swim 100 meters. If it’s safe to assume
a normal model for the distribution for times, what are the mean and standard deviation of the
distribution of times?
Assessing Normality: How to assess whether a normal model is appropriate for data
If the problem set doesn’t explicitly state that the distribution is normal, it is necessary for you to
check for normality. For more information, see pages 148-154 in your textbook.
Method 1: Plot the data (this is not enough to prove Normality, you have to confirm with another
method). Construct a histogram or a stemplot. See if the graph is approximately…
unimodal, bell-shaped and symmetric about the mean
Method 2: Construct a Normal probability plot or Normal quantile plot. This type of plot
displays how much each data value varied for what would be expected if the data was normal. a. This can be found on the TI-84
b. This is a scatter plot of each data value versus it’s predicted z-score based on its order in
the data set.
c. If the distribution is roughly Normal, then the plot should look like a diagonal straight
line.
DePutron AP Statistics Unit I Packet: Exploring and Describing Data 54
By hand
1. Arrange the observed data values from smallest to largest. Record the percentile of the
data each value occupies.
2. Use the standard Normal distribution to find the z-scores at these percentiles.
3. Plot each data point x against the corresponding z. If the data distribution is close to
Normal, then the plotted points will lie close to some straight line.
Method 3: Does the data follow the 68-95-99.7% rule? Check each standard deviation.
a. Compute the mean and SD of the data
b. Then see if about 68% of the data are within one SD of the mean, 95% are within 2 SD
of the mean, and nearly all (99.7%) are within 3 CD of the mean.