146
Introduction to Applied Statistics Xiaobo Sheng

Introduction to Applied Statistics

  • Upload
    iain

  • View
    61

  • Download
    2

Embed Size (px)

DESCRIPTION

Introduction to Applied Statistics. Xiaobo Sheng. Overview. CH 1 Introduction CH 2-3 Concepts, Descriptive Statistics of one variable CH 6-8 Probability, A few common probability distributions and models CH 9-13 Statistical Inference CH 15 Linear Regression. - PowerPoint PPT Presentation

Citation preview

Introduction to Applied Statistics

Introduction to Applied StatisticsXiaobo Sheng1OverviewCH 1 IntroductionCH 2-3 Concepts, Descriptive Statistics of one variableCH 6-8 Probability, A few common probability distributions and modelsCH 9-13 Statistical InferenceCH 15 Linear RegressionIntroductionWhat is statistics?

A collection of numerical information

Or the branch of mathematics dealing with theory and techniques of collecting, organizing, and interpreting numerical information. (We will focus on the first definition)Why we need Statistics? Pepsi vs Coca Horse Racing Casino Game

How do we deal with Statistics?Input: Data Set (a collection of information)

Process: Data analysis(Making sense of a data set)Output: Statistical Inference(Drawing conclusion about a population based on a sample from that population)

BarneyTedLilyRobinMarshallA-B+ACFA few basic definitions need to know Population: the group or collection of interest to us. Usually it will be very huge and messy. Sample : a subset of population. reasonable small and capable of being analyzed using statistical tools. And we use the observations in the sample to learn about the population. Example : income of teachers. Average age, etc.Descriptive statistic a number used to summarize information in a set of data values. varies by different problems. Variable : a particular piece of information Two types: quantitative variable : has numerical values that are measurements categorical variable : values can not be interpreted as numbers. 1st quartile (25th percentile) at least three-fourths are greater than or equal to the first quartile 3rd quartile (75th percentile) at least three-fourths are less than or equal to the first quartile Page 49

Range Difference between the largest and smallest values of a data set. Interquartile range Difference between the 3rd and 1st quartiles Standard Deviation use it to measure variation of values about the mean population standard deviation s sample standard deviation P82 Lists, Tables, and PlotsData list A listing of the values of a variable in a data set.

Table

Table: Usually values in table are ordered or sorted by certain standard. If not, we can use Excel to finish this process. PlotsDot Plot

Frequency Table

Histogram

DistributionA description of how the values of the variable are positioned along an axis or number line. Symmetric Skewed to the left(negatively skewed) there is a concentration of relatively values, with some scatter over a range of smaller values. Skewed to the right(positively skewed) there is a concentration of relatively values, with some scatter over a range of larger values.

Peak A major concentration of values.

Unimodal distribution has one major peakBimodal has two major peaksMultimodal has several major peaksBox plot

Box graph

CH4Scatterplot two-dimensional graphical display of two quantitative variables.

Transformation of a variable a mathematical manipulation of each value of the variable. logarithmic transformation(common one) square root transformation power transformationLogarithmic transformation take the logarithm of each value of the variable.

Further variables relationship analysis in ch.15 Homework

Ch 15 Correlation, RegressionStudy relationship between quantitative variables Linear Correlation Coefficient

Mathematical Notation

(1)

Another form (2)

Formal Definition Correlation Coefficient(Pearsons correlation coefficient) A measure of linear association between two quantitative variables r has no unit, and takes value from -1 to 1.A correlation coefficient near 0 suggests there is little or no linear association between those two variables

Example

What exactly does the correlation coefficient measure? It measures the extent of clustering of plotted points about a straight line. A correlation coefficient that is large in absolute value suggests strong linear association between the two variables. A correlation coefficient that near zero suggests little linear association between the two variables.

Can correlation coefficient be misleading?Yes. We should always plot two quantitative variables to get a visual feel for their relationship. Then we can use the correlation coefficient to supplement the plot.

r is 0.66. By itself, this correlation coefficient might suggest linear association between these two variables. But the figure itself suggests a curved relationship. A stronger linear relationship exists between life expectancy and the logarithm of per capita gross national product.(r = 0.84)OutlierAn observation that is far from the other observations.

Simple Linear RegressionMethod of least squares

Example

Scatterplot

Calculation table

Scatterplot with least square line

Intercept has no physical meaning here.

Definition of Linear RegressionSimple linear regression refers to fitting a straight line model by the method of least squares and then assessing the model. Application: Find out relationship between two quantitative variables Can be used to predict future. Standard deviation line

Whats the relationship between those two lines and explain why.Homework 1. Prove why those two forms (1) and (2) are the same. 2. Prove why r is always between values -1 and 1 3. Show the details of calculating b0 and b1 4. After standardize the variable, why the variance becomes 1. Project Report Requirement 1.Details of experiment you did(including environment, time, process and variables you want to measure) Or source of the data you picked(either online or somewhere else), describe what those numbers stand for and which two variables you are measuring. 2. Data set ( variable name, unit, trials) 3. Scatter plot(including label of axis, unit of each variable) 4. Linear Regression Line.(y=bo +b1x) 5. Detail calculation of bo and b1 regarding to your data set. 6. Correlation Coefficient 7. Summary (Make it 25 pages)

Ch6 Probability50% chance of shower1/2 chance to get head, 1/2 tailCeltics vs Lakers 60% Celtics will win. Probability of an event is the chance or likelihood of the event occurring. Experiment, Outcome, Sample Space, EventsExperiment A process leading to a well-defined observation or outcome. Ex: Roll a die, toss a coin Sample space A set of all possible outcomes of the experiment Ex: Head/Tail 1,2,3,4,5,6 Finite sample space a sample space that contains a finite number of outcomes (Ex: rolling dies)Continuous sample space a sample space that equals an interval of values. Ex: Class length [0, 75], HeightDiscrete sample space a sample space that contains discontinuous values. Ex: Grades A,B,C,D Age, # of Facebook FriendsMore ExamplesMeasure the snow amount annually.A researcher carries out an experiment as part of a study of cigarette smoking and lung cancer. She selects a male smoker at random from among all male smokers. Then keeps in touch with him until he either develops lung cancer or die with no evidence or lung cancer.OutcomesDiscrete sample space : either with cancer or without cancerContinuous sample space: assume with cancer, whats the time of getting the disease?

58Difference between Continuous Sample space and Discrete sample space.Usually discrete sample space are easy to calculate the probability of certain outcome.Continuous sample space may depend on the distribution function.EventA subset of the sample space. Example: Roll two dies, the sum of two numbers is greater than 7. The class lasts from 49 to 51 minutes. The grade is B and above.Probability FunctionAssigns a unique number or probability to each outcome in a finite sample space S. probability is always greater than or equal to 0 and less than or equal to 1. P(E) denotes the probability of an event E. P(entire sample space) = ? P(not E) = 1 - P(E) 61ExamplesToss a coin three times

P(at least two heads) = ?P(at most two tails) = ?P( at least one tail) = ? Comments:Independent events:If outcome of A has no effect on the possible outcome of B, conversely if outcome of B also has no effect on the possible outcome of A, then we say that A and B are independent.

P(A and B) = P(A) * P(B)

Homework If 75% probability to get a head, 25% get tail, and each toss are independentP(at least two heads) = ?P(at most two tails) = ?P( at least one tail) = ?

Conditional ProbabilityConditional probability of event A given event B, denoted as P(A|B) is the probability that events A and B occur together, divided by the probability of event B:

Dependent events

Ex:

Outcomes: prisoner black, victim black P1 = prisoner black, victim white P2 = prisoner white, victim black P3 = prisoner white, victim white P4 = Page 186

DenoteB = {(prisoner white, victim black), (prisoner white, victim white)}C = {(prisoner black, victim white), (prisoner white, victim white)}P(B and C) =? P(B|C) = ?Independent EventsIf knowing that one of the events occurred does not change the calculated probability that the other event occurred.

This formula can be extended. P(A and B and C) = P(A)*P(B)*P(C) if A and B and C are independent.

Example: Toss coins P(three heads in a row) = ? P(two heads) = ? P( one tail) = ?Bayes Rule

Sensitivity of a diagnostic test is the probability that a person with the condition under study will test positive.Specificity of a diagnostic test is the probability that a person without the condition under study will test negative. Example 6-11 Page 189Two different approaches1.Bayess Rule2.Tree diagramRandom VariableA rule that assigns a number to each outcome in the sample space.Finite random variable is a random variable that takes on a finite number of values.Continuous random variable is a random variable that takes on values in an interval of numbers. Examples p192-193Apply for a job and either offered or not. Sample space= {success, failure} Assign X(success) = 1 X(failure) = 0 Letter grade sample space S = {A,B,C,D,F} Assign a number(GPA) to each letter grade G(A) =4, G(B) =3 ,G(C) = 2, G(D) =1 ,G(F) =0Select one freshman at random from among all freshmen at University. Each freshman represents a possible outcome of the experiment; the sample space consists of all freshmen at University. Record height, weight, and sex. X assigns to each student the number corresponding to height. Y assigns to each student the number corresponding to weight. Z code the sex of the student. And we can assigns Z =1 if female, 2 if male.Based on the assignment, we can write probabilities: P(X > 54) P( 120 Y 150) P( Z = 2)Probability DistributionProbability distribution of a random variable is the collection of probability assigned to events defined by the random variable. Details in Chapter 8.Mean, Variance and Standard Deviation of Finite Random VariableMean(or expected value) of a finite random variable X equals

Denote as Variance

Denote as Var(X),2 Example : P195-196

Three statistics students volunteer for as taste test comparing Coke and Pepsi. Each student tastes samples in two identical-looking cups and decides which beverage he ore she prefers. Suppose the students make selections independently of one another. Suppose also that the probability of picking Pepsi is 3/5 and Coke 2/5 for all three students.Random variable Y: number of Pepsi selections.Possible values of Y : (0,1,2,3)Questions:Expected value of Y?E(Y) = 3*P(Y=3)+2*P(Y=2)+1*P(Y=1)+0*P(Y=0)P(Y=2) = 54/125P(Y=1) = 36/125P(Y =0) = 8/125E(Y) = 225/125 = 1.8

Whats variance of Y?

HomeworkP(A) = 0.008P(T|A) = 0.85P(not T|not A) = 0.90Find P(A|T) using two different methods. Verify your answers.Chapter 7Permutations, combinationsBinomial distributionHypergeometric distributionPermutationAn ordered arrangement of a finite number of items.Example:Suppose you are playing FIFA, world cup. And you are in Group A with three other teams: Brazil, Italy, Spain.How many ways are there for those four teams to decide a rank?1st: You have 4 choices2nd: After you pick one, you have 3 choices to pick.3rd: Again, you only have 2 choices now.4th: After you decide the first 3, the one left is your only choice.Totally: 4*3*2*1 = 4! = 24Generally if you have n objects to arrange in order, there would be n! ways to do it.A different permutation example:Lottery: 5 boxes. In each box, there are 10 balls numbered from 0 to 9.To win: the lottery ticket you bought has to have the exact same number in the exact same order as the winning numbers.Permutation: 105Probability of winning: 1/ 105

CombinationA group of objects selected from a larger collection without regard to order of selection. Example:Previous FIFA. Suppose we divide 4 teams in Group A into 2 suits. ( One suit is teams rank as 1st and 2nd who will go to quarter final. Another suit is teams rank as 3rd and 4th who have to go home)Suppose we only care about which teams are in which suit, and we do not care the order of those 2 teams in the same suit.How many possible ways to arrange this?4!/2!2!

ExamplesCeltics and Lakers meet in the championship final.How many ways that Celtics wins 2 games for the first 4?How many ways that Celtics wins the series in exact 6 games.Binomial DistributionPre-requisite Bernoulli experiment(Bernoulli trial)An experiment that has exactly two possible outcomes.ExampleFair die, roll four times.Win $1 if result is divisible by 3.Probability of win at least $2?Suppose random variable Y equals the number of results that are divisible by 3 in four rolls. Whats the expected gain?If you pay $1 to play. Expect to win or not?

Binomial experimentConsists of n independent repetitions of Bernoulli experiment(only have two possible outcomes)The probability of success the same for each repetition.

Random variable X: count the number of successes in n repetitions.Say X has a binomial distribution, denote as B(n,p)nnumber of repetition. pprobability of success on each repetition.QuestionStraight Forward approach:

What if n is very large?Any other way to do this?

How?Expected Value, Variance

A few properties about Expected values and VarianceE(X+Y) = E(X) + E(Y)E(cX) =cE(X)E(X-Y) = E(X) E(Y)E(X+c) = E(X) + c

Var(X+Y) = Var(X) + Var(Y)Var(X-Y) = Var(X) + Var(Y)Var(cX) =c2 Var(X)ExampleSuppose I am on a vocation(that would be awesome!), and I plan to visit 12 countries. And the probability of I like the country would be 2/3 ( for all 12 of them).What is the probability of I like more than 9 of them?This time straight forward approach(list down all possible outcomes) is kind of impossible.Use Binomial Distribution.Hypergeometric DistributionSuppose RV X counts the number of Type 1 objects in a sample selected at random from a finite collection of objects, each classified as either Type 1 or Type 2. Then we say X has a hypergeometric probability distribution, and we call X a hypergeometric random variable.General Problem Suppose in a group of N objects, m1 are type 1 and m2 are type 2. Select a sample of n at random from objects N. Random Variable X counts the number of type 1 objects in the sample. Whats the probability of in those n objects selected, k of them are type 1? P(X = k)

Ch8 Gaussian(Normal)DistributionGaussian(Normal) DistributionStandard Normal DistributionCentral Limit Theorem

Normal probability function curve vs Standard normal probability function curve

Standard normal distribution has mean of 0 and standard deviation of 1.Area under the curve is 1!Now the probability of random variable Z can be represented by the area!

Cumulative probabilityHas the form P(Xc) where X is a random variable and c is a constant.Tail probabilityIs a probability that is small( less than 0.5) and has the form P(Xc) or P(Xc) for some number c.Do some problemsP(Z1)P(Z2)P(-1Z2)Standardize

Suppose we have a random variable X with mean 3, variance of 4.P(1 X 5 )=?

Approximating Normal DistributionA distribution of data values is approximately Gaussian if the proportion of values in any interval approximately equals the area over that interval under the appropriate Gaussian curve.The distribution of a random variable is approximately Gaussian if the probability that the random variable is in any interval approximately equals the area over that interval under a Gaussian curve.Central Limit TheoremBackgroundRandom Sample A collection of independent random variables with the same probability distribution.Central Limit Theorem

How large does the sample size have to be? Depends on how different the probability distribution of Xis from a Gaussian distribution.

Large-sample result related to the CLM

Chapter 9Basic ideas in StatisticsHypothesis testingDefinitionsStatistical inferenceThe process of drawing conclusions about a population based on a sample from that populationPopulation:HypotheticalHas substanceSample:Random Sample in probability senseRandom Sample in experimental senseStatistical inferenceParametricNonparametricParameterMean, MedianStandard Deviation, VarianceParameter typesPoint EstimateInterval Estimateused to estimate population parameterInterval Estimate Confidence interval an interval estimate of a parameter, with a probability interpretationHypothesis testingA formal strategy for comparing two statements about the state of nature in an experimental situation.Null Hypothesis Alternative hypothesisNull HypothesisA statement about the state of nature in an experimental situation. Generally no difference statement. H0Alternative hypothesisA statement about the state of nature , providing an alternative to that specified in the null hypothesis. Ha or H1Example 1State a specific goal. Write this goal in terms of hypotheses to be compared.Design an experiment to meet this goal.State assumptions and describe how the experiment will be analyzed.(Table 9.1)Carry out the experiment.Analyze and interpret the results.

Example 2Hypothesis Testing General StrategySignificance Level ApproachTest Statistic: A measure of how much the sample observations differ from what we would expect if the null hypothesis were true. Significance Level(Error: Type 1, Type 2 ,Power of test) The probability of saying the observations are inconsistent with the null hypothesis, when the null hypothesis is really true.p-value Approachp-value: the probability of seeing a test statistic as extreme as or more extreme(in the direction of the alternative) than the one observed, if the null hypothesis were really true.

Comments on Hypothesis Testing How do we decide what the hypotheses should be?What assumptions do we make about the sample?What is the assumptions we make do not apply to the actual sampling process?How do we select a test statistic?How do we define an acceptance region and rejection region when using the significance level approach?Why do we consider values of the test statistic more extreme than the one actually observed when calculating p-value?When is a p-value small and when is large?Should we use the significance level or p-value approach to hypothesis testing?How do evaluate the power of a test?

Comments on Experimental DesignExperimental Design the area of statistics concerned with designing an investigation to best meet the study goals, as well as the assumptions for statistical inference. On book P300Chapter 10Large-sample inference about a population meanLarge-sample inference about a proportiont-test and confidence interval on a t distribution(small sample)

Example 10.1(large sample population mean)Example 10.2(large sample proportion)Example 10.3(t distribution)