Xiaobo Sheng. Overview CH 1 Introduction CH 2-3 Concepts, Descriptive Statistics of one variable CH...
146
Introduction to Applied Statistics Xiaobo Sheng
Xiaobo Sheng. Overview CH 1 Introduction CH 2-3 Concepts, Descriptive Statistics of one variable CH 6-8 Probability, A few common probability distributions
Overview CH 1 Introduction CH 2-3 Concepts, Descriptive
Statistics of one variable CH 6-8 Probability, A few common
probability distributions and models CH 9-13 Statistical Inference
CH 15 Linear Regression
Slide 3
Introduction What is statistics? A collection of numerical
information Or the branch of mathematics dealing with theory and
techniques of collecting, organizing, and interpreting numerical
information. (We will focus on the first definition)
Slide 4
Why we need Statistics? Pepsi vs Coca Horse Racing Casino
Game
Slide 5
How do we deal with Statistics? Input: Data Set (a collection
of information) Process: Data analysis(Making sense of a data set)
Output: Statistical Inference(Drawing conclusion about a population
based on a sample from that population) BarneyTedLilyRobinMarshall
A-B+ACF
Slide 6
A few basic definitions need to know Population: the group or
collection of interest to us. Usually it will be very huge and
messy. Sample : a subset of population. reasonable small and
capable of being analyzed using statistical tools. And we use the
observations in the sample to learn about the population. Example :
income of teachers. Average age, etc.
Slide 7
Descriptive statistic a number used to summarize information in
a set of data values. varies by different problems. Variable : a
particular piece of information Two types: quantitative variable :
has numerical values that are measurements categorical variable :
values can not be interpreted as numbers.
Slide 8
Slide 9
1 st quartile (25 th percentile) at least three-fourths are
greater than or equal to the first quartile 3 rd quartile (75 th
percentile) at least three-fourths are less than or equal to the
first quartile Page 49
Slide 10
Range Difference between the largest and smallest values of a
data set. Interquartile range Difference between the 3 rd and 1 st
quartiles
Slide 11
Standard Deviation use it to measure variation of values about
the mean population standard deviation s sample standard deviation
P82
Slide 12
Lists, Tables, and Plots Data list A listing of the values of a
variable in a data set.
Slide 13
Table
Slide 14
Table: Usually values in table are ordered or sorted by certain
standard. If not, we can use Excel to finish this process.
Slide 15
Plots Dot Plot
Slide 16
Frequency Table
Slide 17
Slide 18
Histogram
Slide 19
Distribution A description of how the values of the variable
are positioned along an axis or number line. Symmetric Skewed to
the left(negatively skewed) there is a concentration of relatively
values, with some scatter over a range of smaller values. Skewed to
the right(positively skewed) there is a concentration of relatively
values, with some scatter over a range of larger values.
Slide 20
Slide 21
Peak A major concentration of values.
Slide 22
Unimodal distribution has one major peak Bimodal has two major
peaks Multimodal has several major peaks
Slide 23
Box plot
Slide 24
Box graph
Slide 25
CH4 Scatterplot two-dimensional graphical display of two
quantitative variables.
Slide 26
Slide 27
Slide 28
Transformation of a variable a mathematical manipulation of
each value of the variable. logarithmic transformation(common one)
square root transformation power transformation
Slide 29
Logarithmic transformation take the logarithm of each value of
the variable.
Slide 30
Further variables relationship analysis in ch.15 Homework
Slide 31
Ch 15 Correlation, Regression Study relationship between
quantitative variables Linear Correlation Coefficient
Slide 32
Mathematical Notation (1) Another form (2)
Slide 33
Formal Definition Correlation Coefficient(Pearsons correlation
coefficient) A measure of linear association between two
quantitative variables r has no unit, and takes value from -1 to
1.
Slide 34
A correlation coefficient near 0 suggests there is little or no
linear association between those two variables
Slide 35
Example
Slide 36
What exactly does the correlation coefficient measure? It
measures the extent of clustering of plotted points about a
straight line. A correlation coefficient that is large in absolute
value suggests strong linear association between the two variables.
A correlation coefficient that near zero suggests little linear
association between the two variables.
Slide 37
Can correlation coefficient be misleading? Yes. We should
always plot two quantitative variables to get a visual feel for
their relationship. Then we can use the correlation coefficient to
supplement the plot.
Slide 38
Slide 39
r is 0.66. By itself, this correlation coefficient might
suggest linear association between these two variables. But the
figure itself suggests a curved relationship. A stronger linear
relationship exists between life expectancy and the logarithm of
per capita gross national product.(r = 0.84)
Slide 40
Outlier An observation that is far from the other
observations.
Slide 41
Slide 42
Slide 43
Simple Linear Regression Method of least squares
Slide 44
Slide 45
Example
Slide 46
Scatterplot
Slide 47
Calculation table
Slide 48
Scatterplot with least square line
Slide 49
Intercept has no physical meaning here.
Slide 50
Definition of Linear Regression Simple linear regression refers
to fitting a straight line model by the method of least squares and
then assessing the model. Application: Find out relationship
between two quantitative variables Can be used to predict
future.
Slide 51
Standard deviation line
Slide 52
Whats the relationship between those two lines and explain
why.
Slide 53
Homework 1. Prove why those two forms (1) and (2) are the same.
2. Prove why r is always between values -1 and 1 3. Show the
details of calculating b 0 and b 1 4. After standardize the
variable, why the variance becomes 1.
Slide 54
Project Report Requirement 1.Details of experiment you
did(including environment, time, process and variables you want to
measure) Or source of the data you picked(either online or
somewhere else), describe what those numbers stand for and which
two variables you are measuring. 2. Data set ( variable name, unit,
trials) 3. Scatter plot(including label of axis, unit of each
variable) 4. Linear Regression Line.(y=b o +b 1 x) 5. Detail
calculation of b o and b 1 regarding to your data set. 6.
Correlation Coefficient 7. Summary (Make it 25 pages)
Slide 55
Ch6 Probability 50% chance of shower 1/2 chance to get head,
1/2 tail Celtics vs Lakers 60% Celtics will win. Probability of an
event is the chance or likelihood of the event occurring.
Slide 56
Experiment, Outcome, Sample Space, Events Experiment A process
leading to a well-defined observation or outcome. Ex: Roll a die,
toss a coin Sample space A set of all possible outcomes of the
experiment Ex: Head/Tail 1,2,3,4,5,6
Slide 57
Finite sample space a sample space that contains a finite
number of outcomes (Ex: rolling dies) Continuous sample space a
sample space that equals an interval of values. Ex: Class length
[0, 75], Height Discrete sample space a sample space that contains
discontinuous values. Ex: Grades A,B,C,D Age, # of Facebook
Friends
Slide 58
More Examples Measure the snow amount annually. A researcher
carries out an experiment as part of a study of cigarette smoking
and lung cancer. She selects a male smoker at random from among all
male smokers. Then keeps in touch with him until he either develops
lung cancer or die with no evidence or lung cancer. Outcomes
Discrete sample space : either with cancer or without cancer
Continuous sample space: assume with cancer, whats the time of
getting the disease?
Slide 59
Difference between Continuous Sample space and Discrete sample
space. Usually discrete sample space are easy to calculate the
probability of certain outcome. Continuous sample space may depend
on the distribution function.
Slide 60
Event A subset of the sample space. Example: Roll two dies, the
sum of two numbers is greater than 7. The class lasts from 49 to 51
minutes. The grade is B and above.
Slide 61
Probability Function Assigns a unique number or probability to
each outcome in a finite sample space S. probability is always
greater than or equal to 0 and less than or equal to 1. P(E)
denotes the probability of an event E. P(entire sample space) = ?
P(not E) = 1 - P(E)
Slide 62
Examples Toss a coin three times
Slide 63
Slide 64
P(at least two heads) = ? P(at most two tails) = ? P( at least
one tail) = ?
Slide 65
Comments: Independent events: If outcome of A has no effect on
the possible outcome of B, conversely if outcome of B also has no
effect on the possible outcome of A, then we say that A and B are
independent. P(A and B) = P(A) * P(B)
Slide 66
Slide 67
Homework If 75% probability to get a head, 25% get tail, and
each toss are independent P(at least two heads) = ? P(at most two
tails) = ? P( at least one tail) = ?
Slide 68
Conditional Probability Conditional probability of event A
given event B, denoted as P(A|B) is the probability that events A
and B occur together, divided by the probability of event B:
Slide 69
Dependent events Ex:
Slide 70
Outcomes: prisoner black, victim black P 1 = prisoner black,
victim white P 2 = prisoner white, victim black P 3 = prisoner
white, victim white P 4 = Page 186
Slide 71
Denote B = {(prisoner white, victim black), (prisoner white,
victim white)} C = {(prisoner black, victim white), (prisoner
white, victim white)} P(B and C) =? P(B|C) = ?
Slide 72
Independent Events If knowing that one of the events occurred
does not change the calculated probability that the other event
occurred. This formula can be extended. P(A and B and C) =
P(A)*P(B)*P(C) if A and B and C are independent.
Slide 73
Example: Toss coins P(three heads in a row) = ? P(two heads) =
? P( one tail) = ?
Slide 74
Bayes Rule
Slide 75
Sensitivity of a diagnostic test is the probability that a
person with the condition under study will test positive.
Specificity of a diagnostic test is the probability that a person
without the condition under study will test negative.
Slide 76
Example 6-11 Page 189 Two different approaches 1.Bayess Rule
2.Tree diagram
Slide 77
Random Variable A rule that assigns a number to each outcome in
the sample space. Finite random variable is a random variable that
takes on a finite number of values. Continuous random variable is a
random variable that takes on values in an interval of numbers.
Examples p192-193
Slide 78
Apply for a job and either offered or not. Sample space=
{success, failure} Assign X(success) = 1 X(failure) = 0 Letter
grade sample space S = {A,B,C,D,F} Assign a number(GPA) to each
letter grade G(A) =4, G(B) =3,G(C) = 2, G(D) =1,G(F) =0 54) P( 120
Y 150) P( Z = 2)">
Based on the assignment, we can write probabilities: P(X >
54) P( 120 Y 150) P( Z = 2)
Slide 81
Probability Distribution Probability distribution of a random
variable is the collection of probability assigned to events
defined by the random variable. Details in Chapter 8.
Slide 82
Mean, Variance and Standard Deviation of Finite Random Variable
Mean(or expected value) of a finite random variable X equals Denote
as Variance Denote as Var(X), 2 Example : P195-196
Slide 83
Three statistics students volunteer for as taste test comparing
Coke and Pepsi. Each student tastes samples in two
identical-looking cups and decides which beverage he ore she
prefers. Suppose the students make selections independently of one
another. Suppose also that the probability of picking Pepsi is 3/5
and Coke 2/5 for all three students. Random variable Y: number of
Pepsi selections. Possible values of Y : (0,1,2,3)
Homework P(A) = 0.008 P(T|A) = 0.85 P(not T|not A) = 0.90 Find
P(A|T) using two different methods. Verify your answers.
Slide 87
Chapter 7 Permutations, combinations Binomial distribution
Hypergeometric distribution
Slide 88
Permutation An ordered arrangement of a finite number of items.
Example: Suppose you are playing FIFA, world cup. And you are in
Group A with three other teams: Brazil, Italy, Spain. How many ways
are there for those four teams to decide a rank? 1 st : You have 4
choices 2 nd : After you pick one, you have 3 choices to pick. 3 rd
: Again, you only have 2 choices now. 4 th : After you decide the
first 3, the one left is your only choice. Totally: 4*3*2*1 = 4! =
24
Slide 89
Generally if you have n objects to arrange in order, there
would be n! ways to do it. A different permutation example:
Lottery: 5 boxes. In each box, there are 10 balls numbered from 0
to 9. To win: the lottery ticket you bought has to have the exact
same number in the exact same order as the winning numbers.
Permutation: 10 5 Probability of winning: 1/ 10 5
Slide 90
Combination A group of objects selected from a larger
collection without regard to order of selection. Example: Previous
FIFA. Suppose we divide 4 teams in Group A into 2 suits. ( One suit
is teams rank as 1 st and 2 nd who will go to quarter final.
Another suit is teams rank as 3 rd and 4 th who have to go home)
Suppose we only care about which teams are in which suit, and we do
not care the order of those 2 teams in the same suit. How many
possible ways to arrange this? 4!/2!2!
Slide 91
Slide 92
Examples Celtics and Lakers meet in the championship final. How
many ways that Celtics wins 2 games for the first 4? How many ways
that Celtics wins the series in exact 6 games.
Slide 93
Binomial Distribution Pre-requisite Bernoulli
experiment(Bernoulli trial) An experiment that has exactly two
possible outcomes.
Slide 94
Example Fair die, roll four times. Win $1 if result is
divisible by 3. Probability of win at least $2? Suppose random
variable Y equals the number of results that are divisible by 3 in
four rolls. Whats the expected gain? If you pay $1 to play. Expect
to win or not?
Slide 95
Binomial experiment Consists of n independent repetitions of
Bernoulli experiment(only have two possible outcomes) The
probability of success the same for each repetition. Random
variable X: count the number of successes in n repetitions. Say X
has a binomial distribution, denote as B(n,p) nnumber of
repetition. pprobability of success on each repetition.
Slide 96
Question
Slide 97
Straight Forward approach:
Slide 98
What if n is very large? Any other way to do this?
Slide 99
How?
Slide 100
Expected Value, Variance
Slide 101
A few properties about Expected values and Variance E(X+Y) =
E(X) + E(Y) E(cX) =cE(X) E(X-Y) = E(X) E(Y) E(X+c) = E(X) + c
Var(X+Y) = Var(X) + Var(Y) Var(X-Y) = Var(X) + Var(Y) Var(cX) =c 2
Var(X)
Slide 102
Example Suppose I am on a vocation(that would be awesome!), and
I plan to visit 12 countries. And the probability of I like the
country would be 2/3 ( for all 12 of them). What is the probability
of I like more than 9 of them? This time straight forward
approach(list down all possible outcomes) is kind of impossible.
Use Binomial Distribution.
Slide 103
Hypergeometric Distribution Suppose RV X counts the number of
Type 1 objects in a sample selected at random from a finite
collection of objects, each classified as either Type 1 or Type 2.
Then we say X has a hypergeometric probability distribution, and we
call X a hypergeometric random variable.
Slide 104
General Problem Suppose in a group of N objects, m 1 are type 1
and m 2 are type 2. Select a sample of n at random from objects N.
Random Variable X counts the number of type 1 objects in the
sample. Whats the probability of in those n objects selected, k of
them are type 1? P(X = k)
Slide 105
Slide 106
Slide 107
Slide 108
Ch8 Gaussian(Normal)Distribution Gaussian(Normal) Distribution
Standard Normal Distribution Central Limit Theorem
Slide 109
Slide 110
Normal probability function curve vs Standard normal
probability function curve Standard normal distribution has mean of
0 and standard deviation of 1.
Slide 111
Area under the curve is 1! Now the probability of random
variable Z can be represented by the area!
Slide 112
Slide 113
Cumulative probability Has the form P(Xc) where X is a random
variable and c is a constant. Tail probability Is a probability
that is small( less than 0.5) and has the form P(Xc) or P(Xc) for
some number c.
Slide 114
Do some problems P(Z1) P(Z2) P(-1Z2)
Slide 115
Standardize
Slide 116
Suppose we have a random variable X with mean 3, variance of 4.
P(1 X 5 )=?
Slide 117
Approximating Normal Distribution A distribution of data values
is approximately Gaussian if the proportion of values in any
interval approximately equals the area over that interval under the
appropriate Gaussian curve. The distribution of a random variable
is approximately Gaussian if the probability that the random
variable is in any interval approximately equals the area over that
interval under a Gaussian curve.
Slide 118
Central Limit Theorem Background Random Sample A collection of
independent random variables with the same probability
distribution.
Slide 119
Slide 120
Central Limit Theorem
Slide 121
How large does the sample size have to be? Depends on how
different the probability distribution of X i s from a Gaussian
distribution.
Slide 122
Slide 123
Large-sample result related to the CLM
Slide 124
Chapter 9 Basic ideas in Statistics Hypothesis testing
Slide 125
Definitions Statistical inference The process of drawing
conclusions about a population based on a sample from that
population Population: Hypothetical Has substance Sample: Random
Sample in probability sense Random Sample in experimental
sense
Slide 126
Statistical inference Parametric Nonparametric Parameter Mean,
Median Standard Deviation, Variance Parameter types Point Estimate
Interval Estimate used to estimate population parameter
Slide 127
Interval Estimate Confidence interval an interval estimate of a
parameter, with a probability interpretation Hypothesis testing A
formal strategy for comparing two statements about the state of
nature in an experimental situation. Null Hypothesis Alternative
hypothesis
Slide 128
Null Hypothesis A statement about the state of nature in an
experimental situation. Generally no difference statement. H 0
Alternative hypothesis A statement about the state of nature,
providing an alternative to that specified in the null hypothesis.
H a or H 1
Slide 129
Example 1 State a specific goal. Write this goal in terms of
hypotheses to be compared. State a specific goal. Write this goal
in terms of hypotheses to be compared. Design an experiment to meet
this goal. State assumptions and describe how the experiment will
be analyzed.(Table 9.1) State assumptions and describe how the
experiment will be analyzed.Table 9.1 Carry out the experiment.
Analyze and interpret the results.
Slide 130
Example 2
Slide 131
Hypothesis Testing General Strategy Significance Level Approach
Test Statistic: A measure of how much the sample observations
differ from what we would expect if the null hypothesis were true.
Significance Level(Error: Type 1, Type 2,Power of test)Type 1, Type
2,Power of test The probability of saying the observations are
inconsistent with the null hypothesis, when the null hypothesis is
really true. p-value Approach p-value Approach p-value: the
probability of seeing a test statistic as extreme as or more
extreme(in the direction of the alternative) than the one observed,
if the null hypothesis were really true.
Slide 132
Comments on Hypothesis Testing How do we decide what the
hypotheses should be? What assumptions do we make about the sample?
What is the assumptions we make do not apply to the actual sampling
process? How do we select a test statistic? How do we define an
acceptance region and rejection region when using the significance
level approach? Why do we consider values of the test statistic
more extreme than the one actually observed when calculating
p-value? When is a p-value small and when is large? Should we use
the significance level or p-value approach to hypothesis testing?
How do evaluate the power of a test?
Slide 133
Comments on Experimental Design Experimental Design the area of
statistics concerned with designing an investigation to best meet
the study goals, as well as the assumptions for statistical
inference. On book P300
Slide 134
back
Slide 135
Slide 136
Slide 137
Slide 138
Slide 139
Slide 140
Slide 141
Slide 142
Slide 143
Chapter 10 Large-sample inference about a population mean
Large-sample inference about a proportion t-test and confidence
interval on a t distribution(small sample)
Slide 144
Example 10.1(large sample population mean) Example 10.2(large
sample proportion) Example 10.3(t distribution)