Xiaobo Sheng. Overview CH 1 Introduction CH 2-3 Concepts, Descriptive Statistics of one variable CH 6-8 Probability, A few common probability distributions

Xiaobo Sheng

Overview CH 1 Introduction CH 2-3 Concepts, Descriptive Statistics of one variable CH 6-8 Probability, A few common probability distributions and models CH 9-13 Statistical Inference CH 15 Linear Regression

Introduction What is statistics? A collection of numerical information Or the branch of mathematics dealing with theory and techniques of collecting, organizing, and interpreting numerical information. (We will focus on the first definition)

Why we need Statistics? Pepsi vs Coca Horse Racing Casino Game

How do we deal with Statistics? Input: Data Set (a collection of information) Process: Data analysis(Making sense of a data set) Output: Statistical Inference(Drawing conclusion about a population based on a sample from that population) BarneyTedLilyRobinMarshall A-B+ACF

A few basic definitions need to know Population: the group or collection of interest to us. Usually it will be very huge and messy. Sample : a subset of population. reasonable small and capable of being analyzed using statistical tools. And we use the observations in the sample to learn about the population. Example : income of teachers. Average age, etc.

Descriptive statistic a number used to summarize information in a set of data values. varies by different problems. Variable : a particular piece of information Two types: quantitative variable : has numerical values that are measurements categorical variable : values can not be interpreted as numbers.

1 st quartile (25 th percentile) at least three-fourths are greater than or equal to the first quartile 3 rd quartile (75 th percentile) at least three-fourths are less than or equal to the first quartile Page 49

Range Difference between the largest and smallest values of a data set. Interquartile range Difference between the 3 rd and 1 st quartiles

Standard Deviation use it to measure variation of values about the mean population standard deviation s sample standard deviation P82

Lists, Tables, and Plots Data list A listing of the values of a variable in a data set.

Table: Usually values in table are ordered or sorted by certain standard. If not, we can use Excel to finish this process.

Plots Dot Plot

Frequency Table

Histogram

Distribution A description of how the values of the variable are positioned along an axis or number line. Symmetric Skewed to the left(negatively skewed) there is a concentration of relatively values, with some scatter over a range of smaller values. Skewed to the right(positively skewed) there is a concentration of relatively values, with some scatter over a range of larger values.

Peak A major concentration of values.

Unimodal distribution has one major peak Bimodal has two major peaks Multimodal has several major peaks

Box plot

Box graph

CH4 Scatterplot two-dimensional graphical display of two quantitative variables.

Transformation of a variable a mathematical manipulation of each value of the variable. logarithmic transformation(common one) square root transformation power transformation

Logarithmic transformation take the logarithm of each value of the variable.

Further variables relationship analysis in ch.15 Homework

Ch 15 Correlation, Regression Study relationship between quantitative variables Linear Correlation Coefficient

Mathematical Notation (1) Another form (2)

Formal Definition Correlation Coefficient(Pearsons correlation coefficient) A measure of linear association between two quantitative variables r has no unit, and takes value from -1 to 1.

A correlation coefficient near 0 suggests there is little or no linear association between those two variables

Example

What exactly does the correlation coefficient measure? It measures the extent of clustering of plotted points about a straight line. A correlation coefficient that is large in absolute value suggests strong linear association between the two variables. A correlation coefficient that near zero suggests little linear association between the two variables.

Can correlation coefficient be misleading? Yes. We should always plot two quantitative variables to get a visual feel for their relationship. Then we can use the correlation coefficient to supplement the plot.

r is 0.66. By itself, this correlation coefficient might suggest linear association between these two variables. But the figure itself suggests a curved relationship. A stronger linear relationship exists between life expectancy and the logarithm of per capita gross national product.(r = 0.84)

Outlier An observation that is far from the other observations.

Simple Linear Regression Method of least squares

Example

Scatterplot

Calculation table

Scatterplot with least square line

Intercept has no physical meaning here.

Definition of Linear Regression Simple linear regression refers to fitting a straight line model by the method of least squares and then assessing the model. Application: Find out relationship between two quantitative variables Can be used to predict future.

Standard deviation line

Whats the relationship between those two lines and explain why.

Homework 1. Prove why those two forms (1) and (2) are the same. 2. Prove why r is always between values -1 and 1 3. Show the details of calculating b 0 and b 1 4. After standardize the variable, why the variance becomes 1.

Project Report Requirement 1.Details of experiment you did(including environment, time, process and variables you want to measure) Or source of the data you picked(either online or somewhere else), describe what those numbers stand for and which two variables you are measuring. 2. Data set ( variable name, unit, trials) 3. Scatter plot(including label of axis, unit of each variable) 4. Linear Regression Line.(y=b o +b 1 x) 5. Detail calculation of b o and b 1 regarding to your data set. 6. Correlation Coefficient 7. Summary (Make it 25 pages)

Ch6 Probability 50% chance of shower 1/2 chance to get head, 1/2 tail Celtics vs Lakers 60% Celtics will win. Probability of an event is the chance or likelihood of the event occurring.

Experiment, Outcome, Sample Space, Events Experiment A process leading to a well-defined observation or outcome. Ex: Roll a die, toss a coin Sample space A set of all possible outcomes of the experiment Ex: Head/Tail 1,2,3,4,5,6

Finite sample space a sample space that contains a finite number of outcomes (Ex: rolling dies) Continuous sample space a sample space that equals an interval of values. Ex: Class length [0, 75], Height Discrete sample space a sample space that contains discontinuous values. Ex: Grades A,B,C,D Age, # of Facebook Friends

More Examples Measure the snow amount annually. A researcher carries out an experiment as part of a study of cigarette smoking and lung cancer. She selects a male smoker at random from among all male smokers. Then keeps in touch with him until he either develops lung cancer or die with no evidence or lung cancer. Outcomes Discrete sample space : either with cancer or without cancer Continuous sample space: assume with cancer, whats the time of getting the disease?

Difference between Continuous Sample space and Discrete sample space. Usually discrete sample space are easy to calculate the probability of certain outcome. Continuous sample space may depend on the distribution function.

Event A subset of the sample space. Example: Roll two dies, the sum of two numbers is greater than 7. The class lasts from 49 to 51 minutes. The grade is B and above.

Probability Function Assigns a unique number or probability to each outcome in a finite sample space S. probability is always greater than or equal to 0 and less than or equal to 1. P(E) denotes the probability of an event E. P(entire sample space) = ? P(not E) = 1 - P(E)

Examples Toss a coin three times

P(at least two heads) = ? P(at most two tails) = ? P( at least one tail) = ?

Comments: Independent events: If outcome of A has no effect on the possible outcome of B, conversely if outcome of B also has no effect on the possible outcome of A, then we say that A and B are independent. P(A and B) = P(A) * P(B)

Homework If 75% probability to get a head, 25% get tail, and each toss are independent P(at least two heads) = ? P(at most two tails) = ? P( at least one tail) = ?

Conditional Probability Conditional probability of event A given event B, denoted as P(A|B) is the probability that events A and B occur together, divided by the probability of event B:

Dependent events Ex:

Outcomes: prisoner black, victim black P 1 = prisoner black, victim white P 2 = prisoner white, victim black P 3 = prisoner white, victim white P 4 = Page 186

Denote B = {(prisoner white, victim black), (prisoner white, victim white)} C = {(prisoner black, victim white), (prisoner white, victim white)} P(B and C) =? P(B|C) = ?

Independent Events If knowing that one of the events occurred does not change the calculated probability that the other event occurred. This formula can be extended. P(A and B and C) = P(A)*P(B)*P(C) if A and B and C are independent.

Example: Toss coins P(three heads in a row) = ? P(two heads) = ? P( one tail) = ?

Bayes Rule

Sensitivity of a diagnostic test is the probability that a person with the condition under study will test positive. Specificity of a diagnostic test is the probability that a person without the condition under study will test negative.

Example 6-11 Page 189 Two different approaches 1.Bayess Rule 2.Tree diagram

Random Variable A rule that assigns a number to each outcome in the sample space. Finite random variable is a random variable that takes on a finite number of values. Continuous random variable is a random variable that takes on values in an interval of numbers. Examples p192-193

Apply for a job and either offered or not. Sample space= {success, failure} Assign X(success) = 1 X(failure) = 0 Letter grade sample space S = {A,B,C,D,F} Assign a number(GPA) to each letter grade G(A) =4, G(B) =3,G(C) = 2, G(D) =1,G(F) =0 54) P( 120 Y 150) P( Z = 2)">

Based on the assignment, we can write probabilities: P(X > 54) P( 120 Y 150) P( Z = 2)

Probability Distribution Probability distribution of a random variable is the collection of probability assigned to events defined by the random variable. Details in Chapter 8.

Mean, Variance and Standard Deviation of Finite Random Variable Mean(or expected value) of a finite random variable X equals Denote as Variance Denote as Var(X), 2 Example : P195-196

Three statistics students volunteer for as taste test comparing Coke and Pepsi. Each student tastes samples in two identical-looking cups and decides which beverage he ore she prefers. Suppose the students make selections independently of one another. Suppose also that the probability of picking Pepsi is 3/5 and Coke 2/5 for all three students. Random variable Y: number of Pepsi selections. Possible values of Y : (0,1,2,3)

Questions: Expected value of Y? E(Y) = 3*P(Y=3)+2*P(Y=2)+1*P(Y=1)+0*P(Y=0) P(Y=2) = 54/125 P(Y=1) = 36/125 P(Y =0) = 8/125 E(Y) = 225/125 = 1.8

Whats variance of Y?

Homework P(A) = 0.008 P(T|A) = 0.85 P(not T|not A) = 0.90 Find P(A|T) using two different methods. Verify your answers.

Chapter 7 Permutations, combinations Binomial distribution Hypergeometric distribution

Permutation An ordered arrangement of a finite number of items. Example: Suppose you are playing FIFA, world cup. And you are in Group A with three other teams: Brazil, Italy, Spain. How many ways are there for those four teams to decide a rank? 1 st : You have 4 choices 2 nd : After you pick one, you have 3 choices to pick. 3 rd : Again, you only have 2 choices now. 4 th : After you decide the first 3, the one left is your only choice. Totally: 4*3*2*1 = 4! = 24

Generally if you have n objects to arrange in order, there would be n! ways to do it. A different permutation example: Lottery: 5 boxes. In each box, there are 10 balls numbered from 0 to 9. To win: the lottery ticket you bought has to have the exact same number in the exact same order as the winning numbers. Permutation: 10 5 Probability of winning: 1/ 10 5

Combination A group of objects selected from a larger collection without regard to order of selection. Example: Previous FIFA. Suppose we divide 4 teams in Group A into 2 suits. ( One suit is teams rank as 1 st and 2 nd who will go to quarter final. Another suit is teams rank as 3 rd and 4 th who have to go home) Suppose we only care about which teams are in which suit, and we do not care the order of those 2 teams in the same suit. How many possible ways to arrange this? 4!/2!2!

Examples Celtics and Lakers meet in the championship final. How many ways that Celtics wins 2 games for the first 4? How many ways that Celtics wins the series in exact 6 games.

Binomial Distribution Pre-requisite Bernoulli experiment(Bernoulli trial) An experiment that has exactly two possible outcomes.

Example Fair die, roll four times. Win $1 if result is divisible by 3. Probability of win at least $2? Suppose random variable Y equals the number of results that are divisible by 3 in four rolls. Whats the expected gain? If you pay $1 to play. Expect to win or not?

Binomial experiment Consists of n independent repetitions of Bernoulli experiment(only have two possible outcomes) The probability of success the same for each repetition. Random variable X: count the number of successes in n repetitions. Say X has a binomial distribution, denote as B(n,p) nnumber of repetition. pprobability of success on each repetition.

Question

Straight Forward approach:

What if n is very large? Any other way to do this?

Expected Value, Variance

A few properties about Expected values and Variance E(X+Y) = E(X) + E(Y) E(cX) =cE(X) E(X-Y) = E(X) E(Y) E(X+c) = E(X) + c Var(X+Y) = Var(X) + Var(Y) Var(X-Y) = Var(X) + Var(Y) Var(cX) =c 2 Var(X)

Example Suppose I am on a vocation(that would be awesome!), and I plan to visit 12 countries. And the probability of I like the country would be 2/3 ( for all 12 of them). What is the probability of I like more than 9 of them? This time straight forward approach(list down all possible outcomes) is kind of impossible. Use Binomial Distribution.

Hypergeometric Distribution Suppose RV X counts the number of Type 1 objects in a sample selected at random from a finite collection of objects, each classified as either Type 1 or Type 2. Then we say X has a hypergeometric probability distribution, and we call X a hypergeometric random variable.

General Problem Suppose in a group of N objects, m 1 are type 1 and m 2 are type 2. Select a sample of n at random from objects N. Random Variable X counts the number of type 1 objects in the sample. Whats the probability of in those n objects selected, k of them are type 1? P(X = k)

Ch8 Gaussian(Normal)Distribution Gaussian(Normal) Distribution Standard Normal Distribution Central Limit Theorem

Normal probability function curve vs Standard normal probability function curve Standard normal distribution has mean of 0 and standard deviation of 1.

Area under the curve is 1! Now the probability of random variable Z can be represented by the area!

Cumulative probability Has the form P(Xc) where X is a random variable and c is a constant. Tail probability Is a probability that is small( less than 0.5) and has the form P(Xc) or P(Xc) for some number c.

Do some problems P(Z1) P(Z2) P(-1Z2)

Standardize

Suppose we have a random variable X with mean 3, variance of 4. P(1 X 5 )=?

Approximating Normal Distribution A distribution of data values is approximately Gaussian if the proportion of values in any interval approximately equals the area over that interval under the appropriate Gaussian curve. The distribution of a random variable is approximately Gaussian if the probability that the random variable is in any interval approximately equals the area over that interval under a Gaussian curve.

Central Limit Theorem Background Random Sample A collection of independent random variables with the same probability distribution.

Central Limit Theorem

How large does the sample size have to be? Depends on how different the probability distribution of X i s from a Gaussian distribution.

Large-sample result related to the CLM

Chapter 9 Basic ideas in Statistics Hypothesis testing

Definitions Statistical inference The process of drawing conclusions about a population based on a sample from that population Population: Hypothetical Has substance Sample: Random Sample in probability sense Random Sample in experimental sense

Statistical inference Parametric Nonparametric Parameter Mean, Median Standard Deviation, Variance Parameter types Point Estimate Interval Estimate used to estimate population parameter

Interval Estimate Confidence interval an interval estimate of a parameter, with a probability interpretation Hypothesis testing A formal strategy for comparing two statements about the state of nature in an experimental situation. Null Hypothesis Alternative hypothesis

Null Hypothesis A statement about the state of nature in an experimental situation. Generally no difference statement. H 0 Alternative hypothesis A statement about the state of nature, providing an alternative to that specified in the null hypothesis. H a or H 1

Example 1 State a specific goal. Write this goal in terms of hypotheses to be compared. State a specific goal. Write this goal in terms of hypotheses to be compared. Design an experiment to meet this goal. State assumptions and describe how the experiment will be analyzed.(Table 9.1) State assumptions and describe how the experiment will be analyzed.Table 9.1 Carry out the experiment. Analyze and interpret the results.

Example 2

Hypothesis Testing General Strategy Significance Level Approach Test Statistic: A measure of how much the sample observations differ from what we would expect if the null hypothesis were true. Significance Level(Error: Type 1, Type 2,Power of test)Type 1, Type 2,Power of test The probability of saying the observations are inconsistent with the null hypothesis, when the null hypothesis is really true. p-value Approach p-value Approach p-value: the probability of seeing a test statistic as extreme as or more extreme(in the direction of the alternative) than the one observed, if the null hypothesis were really true.

Comments on Hypothesis Testing How do we decide what the hypotheses should be? What assumptions do we make about the sample? What is the assumptions we make do not apply to the actual sampling process? How do we select a test statistic? How do we define an acceptance region and rejection region when using the significance level approach? Why do we consider values of the test statistic more extreme than the one actually observed when calculating p-value? When is a p-value small and when is large? Should we use the significance level or p-value approach to hypothesis testing? How do evaluate the power of a test?

Comments on Experimental Design Experimental Design the area of statistics concerned with designing an investigation to best meet the study goals, as well as the assumptions for statistical inference. On book P300

Chapter 10 Large-sample inference about a population mean Large-sample inference about a proportion t-test and confidence interval on a t distribution(small sample)

Example 10.1(large sample population mean) Example 10.2(large sample proportion) Example 10.3(t distribution)

Documents

Xiaobo Sheng. Overview CH 1 Introduction CH 2-3 Concepts, Descriptive Statistics of one variable CH 6-8 Probability, A few common probability distributions