Notes of Statistical Inference - Casado dcasado-d.org/edu/NotesStatisticalInference-Slides.pdf · Statistical Inference ... these kinds of project from time to time. ... complete

Statistical Inference

Notes of

David Casado de Lucas

You can decide not to print this file and consult it in digital format paper and ink will be saved. Otherwise, print it on recycled paper, double-sided and with less ink. Be ecological. Thank you very much.

3 December 2017

http://www.Casado-D.org/edu/teaching.htmlhttp://www.Casado-D.org/edu/index.html

Contents Inference Theory Point Estimations Confidence Intervals Hypothesis Tests Appendixes More On Statistics Exercises and Problems

Additional Theory

Statistical Kitchen

Further Readings

Probability Theory Mathematics

Tables of Estimators and Statistics

Names of sections are usually links too.

To use these textboxes you must overwrite the file or

save it with a different name.

Errata and linguistic errors are corrected as soon as possible. You may want to update (download and overwrite) the version of this file you might have.

Links to the beginning of the document

and the chapter, respectively.

19132240307

438439450469

474

479515

62

http://www.casado-d.org/edu/NotesStatisticalInference-Slides.pdfhttp://www.casado-d.org/edu/NotesStatisticalInference-Slides.pdf

This file contents the slides that I am writing for my students. I try to consider pieces of advice included in:

http://www.Casado-D.org/edu/GuideForStudents-Slides.pdf

Solved exercises and problems are available at:

http://www.Casado-D.org/edu/ExercisesProblemsStatisticalInference.pdf

Prologue

This document has been created with Linux, LibreOffice, OpenOffice, GIMP and R. (They allow me to work with my old computers.) I thank those who make this software available for free. I donate funds to these kinds of project from time to time.

Acknowledgements

3

http://www.casado-d.org/edu/GuideForStudents-Slides.pdfhttp://www.casado-d.org/edu/ExercisesProblemsStatisticalInference.pdf

MotivationOne of two ways in which students can use this new book is as a supplementary text in a course that demands some statistical thinking but does not focus on statistics. The other use is as a self-teaching preparation for a course that does focus on statistics. It has been my observation, and that of my colleagues, that it is possible for a student to complete such a course without every really thinking about statistics. Many students learn to do the required calculations but have only the foggiest conception of what the calculations mean.

[...]

You may be planning to study statistics not because you want to but because you have to. If so, I know how you feel. I went through the same experience years ago; if I could have avoided statistics, I probably would have. However, my attitude changed after I began to study it, for I discovered in it a new way of thinking that was truly fascinating.

But your present task may be even more challenging than mine was. You won't have to do the computations that I did, but you are about to acquire whithin a very short time (and possibly by yourself) the same grasp of the underlying structure of statistics that I acquired in two full semesters under an excellent teacher.

(From: How to Think about Statistics. Phillips, J.L. W.H. Freeman and Company.)

4

MotivationSomething [...] did happen with the draft lottery in the United States in 1970. People were assigned draft numbers on the basis of their birth dates, with a low number indicating a greater chance of being inducted. The 366 dates were put into capsules, mixed, and drawn and assigned lottery numbers 1, 2, etc. Apparently, the capsules were not mixed very wellpeople born in December had lottery number that averaged 121.5, which is pretty far away from the average of #1-366=183.5. Steps were taken with the 1971 draft lottery to make the results more random by drawing both the date and the lottery number from drums, after mixing them more thoroughly.

[...]

A good example of a nonrandom sample was the 1936 Literary Digest presidential election poll. The Literary Digest had 2 million people respond to its poll, which is a much larger number than would have been needed to get and accurate result if the sample had been selected randomly. However, the poll predicted that Alfred Landon would be an easy winner, whereas in fact Franklin D. Roosevelt won by a landslide. The problem was that the Digest sample was not a random sample. The magazine mailed out cards to people whose names were obtained from telephone lists and other sources, but the people who had telephones at that time were not representative of the population as a whole. If a sample is not selected randomly their is no way to estimate how far off it might be.

(From: Business Statistics. Downing, D., and J. Clark. Barron's.)

5

Suppose you are taking a 20-question multiple-choice exam. Each question has four possible answers, so the probability is .25 that you can answer a question correctly by guessing. What is the probability that you can get at least 10 questions right by pure guessing?

(From: Business Statistics. Downing, D., and J. Clark. Barron's Educational Series.)

A sporting goods store operates in a medium-sized shopping mall. In order to plan staffing levels, the manager has asked for your assistance to determine if there is strong evidence that Monday sales are higher than Saturday sales.

(From: Statistics for Business and Economics. Newbold, P., W. Carlson and B. Thorne. Pearson-Prentice Hall.)

Motivation6

Probability TheoryReview of concepts. Basic formulas. Some well-known distributions. Continuous probability models linked to the normal distribution: 2, t and F. Sums and sequences of independent random variables: theorems, modes of convergence, the central limit theorem.

Inference TheoryConcept of sample. Types of sampling. Concepts of statistic and estimator. Sampling distribution. Main statistics and how they are used.

Point EstimationsEstimation. Estimators of and 2: sample mean, sample proportion, sample variante, difference of means and difference of proportions, ratio of variances. Methods to estimate : maximum likelihood method and method of the moments. Properties of the estimators: unbiasedness, mean square error, efficiency, consistency.

Confidence IntervalsConcept of confidence interval. Construction of confidence intervals: the method of the pivotal quantity. Main cases: mean, proportion, variance, difference of means, difference of proportions, quotient of variances. Margin of error. Minimum sample size.

Hypothesis TestsTypes of statistical hypotheses. Type I and type II errors. Critical or rejection region. P-value. Parametric tests on the: mean, proportion, variance, difference of means, difference of proportions or quotient of variances. Power function. Likelihood ratio tests. Analysis of variance (ANOVA). Nonparametric tests: goodness-of-fit, independence, homogeneity. Chi-square tests. Kolmogorov-Smirnov tests. Analysis of Variance.

7

Syllabus






What is the probability for something to occur?How do we calculate the average or the spread of a quantity?...

Main concepts of Probability: random experiment, random variable, probability function, distribution function, mean, variance, etc.Main discrete and continuous models: Bernoulli, binomial, Poisson, uniform, normal, etc.New models of probability distributions: 2, t and F.Calculation of probabilities and quantiles.

8

Syllabus






How should we study a characteristic of a population?How can we take a representative collection of data?How can we infer population characteristics from sample information?...

Main concepts of Statistical Inference: population, sample, types of sampling, statistics, sampling distribution, etc.

9

Syllabus






How can the real value of a population measure or parameter be approximated?Which estimator should we consider?How can the quality of an estimator be measured?How does an estimator behave when the amount of information increases?...

Two methods to find estimators of any parameter .Properties and quality of all these estimators.Some well-known estimators of the population measures and 2.

10

Syllabus






Why should we base an estimation on a unique numerical value (and its standard deviation)? What is the level of certainty of this value?How can we select an interval of values around the unknown population value?What is the level of certainty of an interval?What is the maximum error (in probability) of an interval?Given the maximum error, what is the minimum number of data necessary to guarantee it?

Main statistics to study the population measures and 2.

Method of the Pivotal Quantity to construct confidence intervals.Confidence.Margin of error. Minimum sample size.

11

Syllabus






Can a population measure be considered bigger than five (e.g.)?Is a variable distributed with the same spread in two different populations?Are two populations really different?Should a probability distribution be used to represent the variable an experiment?...

Main concepts to test hypothesis: types of hypotheses, type I and type II errors, methodologies, certainty of the decision, significance, etc.Parametric tests: questions based on the population measures or a parameter.Analysis of Variance: comparison of the means for several populations.Nonparametric tests: questions based on general characteristics of the population.

12

Syllabus

There are three main chapters and some additional ones. The last chapters may be the first in preparing the subject, since they are tools.

Within each chapter, there is a main body of slides with the basic contents plus some appendixes with complementary ideas that should or can be used for students according to their background and interests.

Theory is difficult to understand without solving exercises (and quality is more important than quantity). A document with dozens of solved exercises is available. There are many proposed exercises, which can be used for self- -evaluation, with their solutions at the end of the chapters.

The slides with the practicals are also at the end of each chapter.

A good way of learning Statistical Inference may be based on the alternate use of the textbook, these slides (they may be useful to order concepts and ideas, since they are thought not only for lectures but also for students to read them autonomously) and the exercises.

How to Use These Slides13

SymbolsHow to Use These Slides

Some slides contain or are marked with any of the following symbols to mean that:

It is useful for you to get a general view of the document

This result plays a role though we do not use it directly but in an easier way.

This slide mentions some further readings you may consider.

This formula is looked at carefully to understand it thoroughly.

This slide or section contains steps that may be useful for beginners.

This slide contains tricky uses of Statistics you may be aware of.

14

[1] Downing, D., and J. Clark. Business Statistics. Barron's Educational Series.

[5] Newbold, P., W. Carlson and B. Thorne. Statistics for Business and Economics. Pearson-Prentice Hall.

[8] Wikipedia http://www.wikipedia.org/

[3] Grimmett, G., and D. Stirzaker. Probability and Random Processes. Oxford University Press.

[4] Mendenhall, W., D.D. Wackerly and R.L. Scheaffer. Mathematical Statistics with Applications. Duxbury Press.

[2] Frank, H., and S.C. Althoen. Statistics: Concepts and Applications. Cambridge University Press.

[7] Serfling, R.J. Approximation Theorems of Mathematical Statistics. John Wiley & Sons.

References (I) Theory15

[6] Prez, C. Tcnicas de muestreo estadstico. Garceta.

http://www.wikipedia.org/

http://www.picgifs.com

http://actividades.parabebes.comThe Cartoon Guide to Statistics.

Larry Gonick and Woollcott Smith.

Harper.

http://blogs.20minutos.es/...(Fotos: FIFA)

I have not been able to find other links. And the source is not always clear.)

http://all-free-download.com/

References (II) Symbols16

http://www.picgifs.com/http://actividades.parabebes.com/http://blogs.20minutos.es/que-paso-en-el-mundial/2014/05/http://all-free-download.com/

[1] Downing, D., and J. Clark. Business Statistics. Barron's Educational Series.

[4] Newbold, P., W. Carlson and B. Thorne. Statistics for Business and Economics. Pearson-Prentice Hall.

[2] Kazmier, L.J. Business Statistics. McGraw Hill.

[3] Mann, P.S. Introductory Statistics. John Wiley & Sons, Inc.

[6] Spiegel, M.R. and L.J. Stephens Statistics. McGraw Hill.

[7] Wikipedia http://www.wikipedia.org/

[4] Miller, I., and M. Miller John E. Freund's Mathematical Statistics with Applicationsn. Pearson.

References (III) Exercises and Problems17

[5] The R Project for Statistical Computing

http://www.r-project.org/

[8] Materials of my Department

http://www.wikipedia.org/http://www.r-project.org/

References (IV) My Documents18

[1] A Brief Guide for Students. http://www.Casado-D.org/edu/GuideForStudents-Slides.pdf

[2] Notes of Probability Theory. http://www.Casado-D.org/edu/NotesProbabilityTheory-Slides.pdf

[3] Notes of Statistical Inference. http://www.Casado-D.org/edu/NotesStatisticalInference-Slides.pdf

[4] Solved Exercises and Problems of Statistical Inference. http://www.Casado-D.org/edu/ExercisesProblemsStatisticalInference.pdf

[5] R Code Applied to Statistics. http://www.Casado-D.org/edu/CodeAppliedToStatistics-Slides.pdf

http://www.casado-d.org/edu/GuideForStudents-Slides.pdfhttp://www.casado-d.org/edu/NotesProbabilityTheory-Slides.pdfhttp://www.casado-d.org/edu/NotesStatisticalInference-Slides.pdfhttp://www.casado-d.org/edu/ExercisesProblemsStatisticalInference.pdfhttp://www.casado-d.org/edu/CodeAppliedToStatistics-Slides.pdf

Inference Theory

19

Sections






Chapters20

Explain some basic ideas and concepts of Statistics. Describe the sampling processconvenience or

necessity.

Present the main kinds of sampling. The simple random sampling.

Define the concepts of statistic and its sampling distribution.

Define the concepts of estimator, estimate and estimation.

Present and motivate the statistics we work with. Use software to practice some of the concepts.

Chapter Goals 21

Among the basic ways of selecting the elements of a sample, only the simple random sampling will be considered. Why Statistics works is motivated through the convergence of the histogram to the probability function or the sample distribution function to the population counterpart (thanks to the laws of large numbers).

The mathematical functions that use the sample to study the population are theoretically studied for any possible sample before using them with a particular sample. Few tables contain all the estimators and statistics necessary for our methods.

For the population quantities of interest, the types of statistical problem and the cases that we deal with.

Basic concepts: randomness, units of measurement, quantities of interest, population, sample, sampling, histograms, use of data, etcTypes of problem Point estimations Confidence intervals Hypothesis testsStatistics and estimators Statistics Estimators Statistics made with estimators Sampling distribution Tables of statistics TsCasesStatistical Studies Steps Qualities Useful questionsUse of Ts How T is usually used Notation FrameworkAppendixes: practicals, guide for students, inference in other fields , etc

Advice to understand how the estimators and the statistics are used to solve the problems, and how mathematical notation must be interpreted. Finally, a summary with the conditions under which we work.

Contents 22

Sections

( Introduction: Basic Concepts )

Inference Theory

Types of Problem

Statistics and Estimators

Cases

Statistical Studies

Use of T's

23

( Appendixes: Practicals)

Statistical Inference in Other FieldsA Brief Guide for Students

Quantity of interest

Variable of interest

Deterministic Random or stochastic

Total knowledge about the process

generating the values.

Partial knowledge (values and probabilities) about the process

generating the values.

Probability distributions (for the variable) are used to explain the real relation

values-probabilities (quantity of interest).

Statistics exploits data (some values and their frequencies) in order to select and

study a probability distribution model (values and their probabilities) in order to

explain the process of interest.

What we'll do, both theoretically and in practice.

Basic Concepts 24

Randomness

We are frequently interested in a characteristicvariableof the elements of a grouppopulation, or we may have interest in how a property behaves in two populations or more.When population variables are supposed to be stochastic, Probability Theory provides a framework to try explaining them. The most important quantities are the measures

= E(X) and 2 = Var (X) = E([X ]2)If any (well-known) parametric probability distribution is used as a model to explain the variable X (model-based approach), we are interested in studying its parameters (e.g. , , ) or the entire distribution, that is:

FX(x)There is a relation between the measures and and the parameters (for the normal distribution, and are directly used as parameters in the density function).

Basic Concepts 25

Studying each element of the population is too long-lasting, expensive or even impossible (the population is infinite or it is necessary to broke or spoil the elements).

Fortunately, it is possible to consider a sample of elementsusually few, with respect to the size of the populationby applying proper sampling techniques so as to guarantee that they are representative and therefore we will succeed in inferring population information. Additionally, considering a sample can reduce some types of error.

There are statistical techniques designed to describe the most important characteristics of both models and dataDescriptive Statistics, to select the model that best suits some data or to approximate the parameters of a particular modelInferential Statistics, or, once a good enough model has been found, to predict or forecast future valuesPredictive Statistics.

When the main statistical process involves distributions with parameters , we talk about Parametric Statistics; otherwise, about Nonparametric (or Distribution-Free) Statistics.

Basic Concepts 26

1 coin

independent coins

1 dice

Economic problem

X ~ B()

X ~ Bin(,)

X ~ UnifDisc(6)

X ~ F with f(x)

Methods that use the sample

X = {X1,...,Xn}

(1) To study = E(X)

(2) To study 2 = Var(X)

(3) To study the parameter(s) (and hence possibly and 2 too)

(4) To study a characteristic (e.g. median of X)

(5) To study the whole FX

Real-World ProbabilityModel Statistics

Relation between , and

For most distributions, appears in the expression of and . Examples Bernoulli: = , and then == and 2 == (1) Poisson: = , and then == and 2 == Normal: = (,), and then == and 2 == 2

(For discrete variables, sums instead of integrals.)

Thus, when we estimate we obtain natural (plug-in) point estimates of and

=E(X )= x f ( x)dx 2=E([XE(X )]2)= (x)2 f (x )dx

Basic Concepts 27

The Statistical World

Theory

Practice

Theoretical population

Empirical population

Element of the population

Empirical sample subset

Theoretical sample subset

X1X n

X 2

X X

XX

X

XX

X F (x ;)

=E(X )2=Var (X )

f (x ;)

X={X1 ,... , X n}

x1xn

x2

x x

xx

x

x

x

x={x1, ... , xn}

Variable of interestElement of the

sample and variable

Data

X

Inferential Process

...

...

Tools of Probability Theory used for: Point Estimations Confidence Intervals Hypothesis Tests Other types of problem Formulas

Basic Concepts 28

...

=E(X )

Deduction InductionInference

2=Var (X )

(X )

(x )

Theory

Practice

Theoretical population

Empirical population

X

Random variableParameters and main measures

T (X )

T ( x)

Evaluations of the estimator and

the statistic

Theoretical sample

x

Possible values of the random variable

X={X 1 , X 2 ,... , X n}

x={x1 , x2 , ... , xn}

Estimator and statistic

Empirical sample

Empirical sample subset

Theoretical sample subset

Probability function (for X continuous) and expected histogram

Empirical histogram

x= 1N x i

X= 1N X i

f ( x ;)

Basic ConceptsThe Statistical World

29

Basic Concepts 30

RandomnessHaving partial knowledge and using only some elements of the population implywe can only hypothesize about the other elementsthat variables must be assigned a random character, on the one hand, and that the results will have no total certainty in the sense that statements will be set with some probability, on the other hand. For example: a 95% confidence in applying a method must be interpreted as any other probability: the results are true with probability 0.95 and false with probability 10.95 (frequently, we will never know if the method has failed or not).

Units of MeasurementIn Probability Theory, random variables are dimensionless quantities; in real- -life problems, variables almost always are not. Since usually this fact does not cause troubles in Statistics, we do not pay much attention to the units of measurement, and we can understand that the magnitude of the real-life variable, with no unit of measurement, is the part that is being modeled by using the proper probability distribution with the proper parameter values (of course, units of measurement are not random). To get used to pay attention to the units of measurement and to manage them, they are written in many numerical expressions.

Naranjito Has a Question

Yes. In fact, I have The Question:

What can these things be used for?

Basic Concepts 31

The Roper Organization conducted a poll in 1992 (Roper, 1992) in which one of the questions asked was whether or not the respondent had ever seen a ghost. Of the 1525 people in the 18 to 29-year-old age group, 212 said yes.a. What is the risk of someone in this age group seeing a ghost?b. What is the approximate margin of error that accompanies the proportion in (a)?c. What is the interval that is 95% certain to contain the actual proportion of people in this age group who have seen a ghost?

(From: Mind on Statistics. Utts, J.M., and R.F. Heckard. Thomson.)

The U.S. Senate has 100 members. Information was obtained from the individuals responsible for managing correspondence in 61 senators' offices. Of these, 38 specified a minimum number of letters that must be received on an issue before a form letter in response is created.a) Assume these observations constitute a random sample from the population, and find a

90% confidence interval for the proportion of all senators' offices with this policy.b) In fact, information was not obtained from a random sample of senate offices.

Questionnaires were sent to all 100 offices, but only 61 responded. How does this information influence your view of the answer to part (a)?(From: Statistics for Business and Economics. Newbold, P., W. Carlson and B. Thorne. Pearson-Prentice Hall.)

Real-World ProblemsBasic Concepts

32

Real-World ProblemsBasic Concepts

33

Well, lets hum a little... [Population] There is a huge population of 18 to 29-year-old students (it cannot be

determined in more detail from the information in the statement). [Model] The answer could be no or yes, so we can model any student of the

population through a Bernoulli variable X whose parameter would be the probability for any student to answer yes (this is a model-based approach).

[Sample] A sample of n = 1525 students was gathered (by applying simple random sampling with or without replacement) and the data are (x1,,x1525), where 212 answers are yes and the others are no.

[Strategy] We need an estimator of to use the data. That risk they talk about is the probability mentionedthe parameter . I must look for the definition of margin of error and how to calculate it. Finally, I must learn a method to build confidence intervals. On the other hand, we will work with random variables, (X1,,Xn), while doing the calculations and until the statistical variables (x1,,xn) are finally substituted in the theoretical expressions.

We must interpret the probability, the margin of error and the confidence interval (in Statistics, there is always a measure of error).

Thoughts

StatisticsSamples

Sampling

X = {X1,...,Xn} sample of elements

Variable: Presence of the policy (of a senator)Interest: Distribution, mean and variance of the variableCharacter: Since we cannot control all the details of the making process, the variable is treated as a random variable

ProbabilityRandom variable X Presence, =E(X), 2=Var(X)Which distribution explains X best?

Population

Methods

How should the elements of the sample be selected?How many of them?

ToolsConcretely, what do we want to study about the variable presence of the policy?How do we use the sample?How trustworthy will our conclusions be?

Basic ConceptsReal-World Problem

34

Real-World Question

Probability Theory

Probability Theory

Mathematical Formulas

Numerical Results

Statistical Interpretation

Real-World Answer

Quantity partially known: Values Frequenciesx = {x1,...,xn}

Random variable: Values Probabilitiesf(x;)X = {X1,...,Xn}

Basic Concepts 35

Subject

(1) (2)

(3)

(4)(5)

In some exercises we go overthrough arrows 2 to 4; in others, over arrows 1 to 5.

Population: set of elements in which we are interested. Examples: (1) All possible clients in a region. (2) The light bulbs of a batch. (3) All five-year-old children of the world.

Parameter: fixed quantitynot a variablethat appears in the expression of the functions of the random variables.

Sample: subset of the population that is considered. Examples: (1) Some potential clients randomly selected for interview. (2) The first six light bulbs of a batch of one thousand. (3) The five-year-old children living near certain interviewers.

Sampling: process to properly select the elements of the sample from the elements of the population.

Note: When working with two populations, we assume that they are independent, meaning that their values do not influence each other (models X and Y are independent). This independence between populations is different from the independence within samples (Xi independent, and, on the other hand, Yi independent).Sometimes data are paired to reduce the effectvariabilityof a factor (e.g. the person who manages a machine); these paired populations need special statistical methods.

36Basic Concepts

Simple: each element is selected independently and with the same probability. E.g.: Inhabitants are selected with the same probability and independently from the whole country.

Cluster: elements are previously grouped into subsets (clusters) as similar as possible among them and to the population. E.g.: Inhabitants are selected with the same probability and independently from some cities representative of the country.

Stratified: elements are previously grouped, by using a characteristic or factor, into subsets (strata) as different as possible among them. E.g.: To analyse the possible effect of the city size, inhabitants are selected with the same probability and independently from some cities of quite different size.

Basic Typesof

Sampling

Simple Random SamplingsThe main theoretical implications for us are the following:

...

General formulas:

We work under this type of sampling. This implies that random variables will be independent copies of the model

Var (X 1X 2)=Var (X 1)+Var (X 2)2 cov (X 1, X 2)

Var (X 1X 2)=Var (X 1)+Var (X 2)

Note: Applying the appropriate sampling allows saving money: e.g. by reducing the travels in the cluster sampling or by quickly attaining the necessary sample sizes in the stratified sampling. Additionally, not all the elements of the population can always be accessed. On the other hand, it does not matter whether the sampling is applied with or without replacement since we assume that n j

kcov (X i , X j)

Under independence:

Var ( j=1k

X j)= j=1k

Var (X j)E ( j=1k

X j )= j=1kE(X j)

E (X 1X 2)=E (X 1)E (X 2)

Basic Concepts

F (x1 , x2 , ... , xn)= j=1n

F X j(x j) f ( x1 , x2 ,... , xn)= j=1n

f X j( x j)

Note: If the sample is not representative of the population, the inferential process will fail. Thus, paying attention to the sampling process applied must be the first step in reading and interpreting any statistical analysis.

A small population and three possible samples

Taken from: R Code Applied to Statistics. David Casado de Lucas. http://www.Casado-D.org/edu/CodeAppliedToStatistics-Slides.pdf( )

Appropriate sampling: the sample does represent the population

(Trustworthy results.)

Inappropriate sampling: the sample does not represent the population(Untrustworthy results.)

38Basic Concepts

http://www.casado-d.org/edu/CodeAppliedToStatistics-Slides.pdfhttp://www.casado-d.org/edu/CodeAppliedToStatistics-Slides.pdf

ProbabilityFunction

X random variable

ExpectedHistogram

EmpiricalHistogram

Discrete Continuous

Mass function Density function

Probability for X totake a value in Ci

Expected absolutefrequency of the i-th

class (think about theexpectation of the

binomial indicator variablenumber of trials inside Ci)

C i

Sample X1,...,Xn Sample x1,...,xn

Empirical absolutefrequency of the i-thclassproportion

of values inside Ci

C i i-th class C i i-th class

i-th class C i i-th class

C i i-th classC i i-th class

Histograms can be built by using either absolute or relative frequencies:

The histograms tend to the probability function, which justifies the use of samples to infer population information.

pi=P (C i)

e i=npi N i

e i

f (x) f n(x )= #{X iC x }n f n(x )=#{x iC x }

n

Laws of large

numbers

39

f i=e in

Basic Concepts

DistributionFunction

X random variable

Expected SampleDistribution Function

Discrete Continuous

Probability for X totake a value in Ci

Expected absolutefrequency of the i-th

class

C i

Sample X1,...,Xn

Sample x1,...,xn

Empirical absolutefrequency of the i-thclassproportion

of values inside Ci

C i i-th class C i i-th class

i-th classC i i-th class

The sample distribution functions Fn(x) tend to the distribution function, which justifies the use of samples to infer population information (in fact, Glivenko-Cantelli theorem proves that the convergence is uniform).

pi=P (C i)

e i=npi

N i

Empirical SampleDistribution Function

C i i-th classC i i-th class

F (x)=P(X x)Fn(x)=

# {x i x }n

Fn(x)=# {X ix }

n

Laws of large

numbers

40Basic Concepts

Use of the Samples

41Basic Concepts

Mathematical, it is possible to consider all possible samples even if the number of them is infinite. It is usually difficult to cope with this task directly, but for most situations we consider addional, indirect results. For example:

can be interpreted as follows: if all possible samples and their probabilities were considered, the sample mean were evaluated at them, and all these quantities were substituted into the expression of the expectation, the final value would be the same as the expectation of X.

X (1)={X 1(1) , X2

(1) ,... , X n(1)}

XX (2 )={X 1

(2) , X 2(2) , ... , Xn

(2 )}

X (m )={X1

(m ) , X2(m) , ... , Xn

(m )}

Discrete

Continuous

Q= 1m j=1

mq j= j=1

mq j1m

E ( X )==E (X )=

Completesampling

Use

Theoretical PopulationMean of Q

Theoretical SampleMean of Q

Partial, realsampling

In practice, to estimate the population mean we consider m samples:

q=Q(X1 , X2 , ... , Xn)

Each time we use the quantity Q, one sample is considered to obtain one value:

All samples

Sample Quantity Q

Q=E(Q)== q jf Q(q j)= qf Q(q)dq

Representation of all possible samples of size n

(the number of them can be infinite or they may not be

ordered totally, in fact)

Use of the Samples

42Basic Concepts

XX (2 )={X 1

(2) , X 2(2) , ... , Xn

(2 )}

X (m )={X1(m ) , X2

(m) , ... , Xn(m)}

Completesampling

Use

P(X(2))

P(X(m))

Q(X (2))

Q(X (m))

Q(X (1))

P(X(1))

P(X(2))

P(X(m))

X (1)={X 1(1) , X2

(1) ,... , X n(1)} P(X(1))

Sampling probability distribution of Q

Random samplingsnot others, e.g. based on expertsare usually the only ones guaranteeing that the sample is representative. Apart form this fact, let us think of a particular representative sample, say {x1,,xn}, and two practitioners.(1)A practitioner using mathematics but not inference theory can enumerate (by extension or by comprehension) all possible

samples and hence all possible values for Q, from which some posterior calculations and representations can be done: sample mean, sample variance, sample median, histogram, et cetera.

(2)Another practitioner using both mathematics and inference theory can enumerate (by extension or by comprehension) all possible samples and their probabilities and hence all possible values for Q with their chances, that is, the sampling distribution of Q. This distribution of probability plays the role of system of reference where values can be compared statistically, which allows this practitioner to quantify the statistical statements, to compare what has happened with what could or should have happened (sampling error), to study what can or will happen, et cetera. (This second practitioner can also do, obviously, what the first practitioner can. Both could study the true error if the whole population were also studied.)

Quantifying

Joint probability distribution of X

Sample size

Asymptotic framework

Finite-sample-size framework

Sample

Some concepts:Asymptotic unbiasedness,

consistency, etc

Some concepts:Unbiasedness, efficiency, etc

Note: Although more data will usually imply better information, this is not true if data have not a minimum quality. This is a problem we do not face, but it may appear in real statistical analyses.

Asymptoticity

Some concepts are studied for a finite value of n, while others are studied in the limit.

There is not a severe change of behaviour at any value of n, although in practice we consider as asympotic those larger than 30 (or 25).

In cases where only few data will be available, asymptotic concepts make no sense, while these concepts are the only important ones when many data will always be involved.

n

Sections


Inference Theory

Types of Problem


Cases

Statistical Studies

Use of T's

44



Data are used as input in statistics and estimators, which will be used in the statistical methods that allow us to study the population characteristics of interest: mean, variance, parameters, et cetera.

We can talk about three main kinds of problem: Point estimations: By using a proper estimator, the value of , or has to be estimated.

Well-known estimators and general methods of building estimators are introduced. The fulfilment of some properties of the sampling distribution of the estimators are studied.

Confidence intervals: By using a proper statistic, an interval of values for or instead of only one valuehas to be obtained in such a way that we could have a minimum certainty that the unknown true value would be inside the interval.

Hypothesis tests: By using a proper statistic, a decision about the value of or (parametric problem) is made by applying testing methodologies. Additionally, other statistics to make decisions about more general questions (nonparametric problems) are introduced. All these statistics evaluate possible discrepancies between the sample information and the population information expected under theoretical conditions.

Note: Weand many authorsrepresent the population quantities through Greek letters: (theta), (mu), (sigma), (lambda), (kappa), (eta), etc. Latin letters or accents are used for sample quantities: S, x, etc.

45Types of Problem

DescriptiveStatistics

Applications

Inference TheoryPoint Estimations Methods: moments and maximum likelihood Properties: unbiasedness, consistency, etcConfidence Intervals Methods: pivotal quantity Minimum Sample sizeHypothesis Tests Methodologies: critical region and p-value Character: parametric and nonparametric

Metasubject Thinking Listening, reading, writing and speaking Teaching and learning Exploitation

Statistical Inference

ProbabilityTheory

Hypotheses

Direct, to solve problems: X Speed of a particle X Presence of an error X Gross domestic product

, , , f(x,), F(x,)

Indirect, to develop new theory: Estimation of the coefficients of

a model. Tests of the hypotheses of a

model. Tests for the diagnosis of a

model.

ResultsInterpretation

Subject

46Types of Problem

Statistical Problem: Study a variable of a population by using samples question random variable probability distribution Probability Theory Statistical Inference Sampling: appropriate process for the sample to be representative of the population Types of sampling, simple random sampling... Inference Theory

Statistic: function that uses the sample information in a proper way Sample mean, sample variance... sampling probability distribution... Probability Theory

Inferential method: technique to answer the statistical question and solve the problem A unique value that estimates a measure (, 2, etc) or a parameter (, , , , etc) Estimators (X, S2, etc) and statistics (T, , etc), methods (maximum likelihood, moments), properties (MSE, etc)... Point Estimations

A set of values and the probability for an unknown measure (, 2, etc) or parameter (, , , , etc) to be inside Estimators (X, S2, etc) and statistics (T, , etc), methods (pivotal quantity)... Confidence Intervals

A decision to choose between two hypotheses about a measure (, 2, etc), a parameter (, , , , etc), a characteristic of F (probability of an event, median, symmetry, etc) or the entire distribution (F), made with bound probability of rejecting the null hypothesis when it is true (). Estimators (X, S2, etc) and statistics (T, , etc), types of error, p-value, power function, methodologies... Hypothesis Tests

Real-World Problem: Study a characteristic of a group weight tree of a forest Biology speed particle of a gas Physics benefit industrial sector Econonomics

(In some cases, more than one variable or population are considered.)

Do not confuse the population distributions

with the sampling distributions of

estimators and statistics.

47Types of Problem

StatisticsSamples

Point Estimations, Confidence Intervals and Hypothesis Tests

X = {X1,...,Xnx} and Y = {Y1,...,Yny}, simple random samples

Although we may know the sampling distribution of the estimators in some cases (e.g. X for normal populations), in general we have to used statistics involving them such that: (1) A theorem tells us the (asymptotic) sampling distribution necessary to calculate probabilities and quantiles. (2) They are dimensionless, so they do not depend on the scale in which the data are measured.

Real-World Problem Variables of interest, processes, means and variabilities...Probability

X and Y, functions FX(x;), fX(x;), FY(y;) and fY(y;), measures X, Y, X, Y,... Populations

Statistics

Estimators

Methods

To Study the Means X and Y To Study the Variances X2 and Y2 To Study the Parameters

They also allow studying the means and the variances.

To Study Measures (, 2, etc) or Parameters (, , , , etc) To Study Characteristics or Whole Probability Distributions i=1

K (N i ei)2

e i, max x|Fn( x)F (x ;)|

( X) nS

, (n1) S2

2

48Types of Problem

Statistical Questions

49

In 1990, 25% of births were by mothers of more than 30 years of age. This year a simple random sample of size 120 births has been taken, yielding the result that 34 of them were by mothers of over 30 years of age.

a)With a significance of 10%, can it be accepted that the proportion of births by mothers of over 30 years of age is still 25%, against that it has increased? Select the statistic, write the critical region and make a decision. Calculate the p-value and make a decision. If the critical region is calculate (probability of the type II error) for 1 = 0.35. Plot the power function with the help of a computer.

b)Obtain a 90% confidence interval for the proportion. Use it to make a decision about the value of , which is equivalent to having applied a two-sided (nondirectional) hypothesis test in the first section.

Rc={> 0.30 } ,

1 Nonparametric Test to validate this assumption.

3 Confidence interval to bind the value of the proportion .

2 Parametric Test to evaluate these hypotheses about the value of .

Types of Problem

Sections


Inference Theory

Types of Problem


Cases

Statistical Studies

Use of T's

50



A statistic T is the mathematical function using the information contained in the sample X = {X1,...,Xn}, that is:

T(X) = T(X1,...,Xn)Since Xi are random variables, T is a random quantity too. Its distribution (possible values and their probabilities) is termed sampling distribution. Sometimes, this distribution is one of the well-known probability models; other times it is difficult to know, although we can always study it empirically (mean, variance, histogram, et cetera). For unknown values X = {X1,...,Xn}, a statistic T(X) is still a random quantity; for specific values x = {x1,...,xn}, the evaluation T(x) is no longer stochastic but a number.

EstimatorsAn estimator is a statistic that is used to estimate the value of a quantity of interest (it cannot depend on it): any of the measures and , or a parameter . To guarantee the quality of an estimator, we define some concepts.The evaluation of an estimator of is termed estimate of the parameter . The whole process is termed estimation.

Pay attention to the notation we use: upper-case letters for random quantities, lower-case letters for their possible values (known or unknown).

Statistics and EstimatorsStatistics

51

To study the quantities in which we are interested, or , we need estimators of them. We introduce two general methods to build estimators of any parameter and some well-known estimators of and .

Once the quality of an estimator is guaranteed (by studying concepts and properties based on its mean and variance), we need know its sampling distribution so as to be able to do calculations (quantiles and probabilities).

In fact, instead of using the estimators themselves we frequently define statistics involving them such that:

Their sampling distributionexact or asymptoticis known in theory

They are dimensionless versions of the estimators

The basic statistics are summarized in tables from which we will select the appropriate for each situation. The underlying theorems will be mentioned.

This will allow us to do the calculations and to evaluate the agreement between the sample information and the theoretical assumptions.

That is why we need not care about the units of measurement of the data during the calculations (although we should care about it for a proper interpretation of the problem and the solution). On the other hand, the natural spread of data is taken into account by these statistics.

Statistics Made with EstimatorsStatistics and Estimators

52

For an estimator E. Given a value e of E, is it small or large? To answer this question we need a system of reference. The probability distribution of E plays this role. How many values ('much' would be a better word for continuous distributions) are above it? Quantiles (median, quartiles, deciles and centiles are referential values). Random variables are dimensionless.

For any random quantity, say E. The distribution of E must be known to judge a value e. Nevertheless, we do not always know this distribution but the distribution of a quantity involving E, say T. This is enough to jugde any value e of E, since it is possible to judge the transformed te within the distribution of T. This quantity T is dimensionless, which makes it more useful even.

Example. In an exam, the average score of the class has been 6.7 points. How good is this score? Answer: It must or should be comparede.g. using a figurewith the distribution of the variable average score for any class taking that exam. (A variable can also be defined as the average of other quantities.)

(Analogous figures can be created for the discrete case.)

e

f E (e) f T (t )or

tor

Statistics and EstimatorsReferential Values of a Sampling Distribution

53

Both statistics and estimators are unidimensional random quantities. The mean and the variance of their sampling distributions should be analysed to study how these quantities behaveas if we were going to use them many times with different samples, even if in practice they are to be used only once.

Let Q be a statistic or an estimator; theoretically, these two measures are

Nevertheless, it is usually difficult to know fQ. Instead, to try finding Q and Q2 we will apply the basic general properties of the measures E() and Var().

Note: We are interested in the sampling distribution of the univariate quantity Q, not in the joint distribution of the random vector (X1,,Xn).

Discrete

Continuous

Discrete

Continuous

Statistics and EstimatorsSampling Distribution

54

=Q q jf Q (q j)

=Q qf Q (q)dq

=Q (q jQ)2f Q(q j)

=Q (qQ)2f Q(q)dq

Concretely, we want to study how the values of Q are distributed with respect to the quantity under study, say .

Another case:

What if the only time we are going to use Q it takes a value q4, far from ? Thus, it is necessary to study the behaviour or quality of the estimators.

Standard Errors (Absolute and Relative)

q4 q1 E(Q) q3 q2

q4 q1 q3 E(Q) q2 q5

The average value E(Q) may be close, or even equal, to while all possible values of Q are far from it. Variability measures how close to the average value E(Q), and hence among them,

the possible values qj are.

The quantity Q is termed sampling or standard (absolute) error of Q; the dimensionless version quantity Q/|Q| is termed sampling or standard (relative) error of Q. For example, when Q = X is used to estimate = X, it takes the form

As the sampling distribution fQ is usually unknown, Q is calculated or estimated by using the last approximation.

qQ 0 q q+Q

We want the average value E(Q) to be as close to as possible.

If the standard error has the same order of magnitude as q, or higher, the estimate is not trustworthy.

Q= X= X2= X2n = Xn Sn=Q


55

Q|Q|

Let Q(X1,...,Xn) be a sample quantity. We are interested in the behaviour of Q, that is, in

Possible values and their probabilities

This is usually difficult. Some details can sometimes be seen by using measures and figures instead of the values themselves (as we do in Descriptive Statistics). Possible properties are:

Probability p of some events involving Q, or quantile c determining the set of thesmallest or biggest values that Q can take with certain probability, for example

Mean, variance, moments, etc. Bias Mean square error Sufficiency: Q contains the same information to estimate the parameter as the whole sample Asymptotic behaviour: asymptotic bias, consistency

On the other hand, comparison of estimators is of great interest:

Relative efficiency: comparison of the mean square error of two estimators Efficiency: unbiasedness plus minimum variance Asymptotic behaviour: asymptotic (relative) efficiency

p=P (Qc )

E(Q) Var (Q)

MSE (Q)=b(Q)2+Var (Q)

x F(x )=P(Xx)

b (Q)=E(Q)

b (Q)2=0 Var (Q) minimum once b(Q)=0


56

Population Quantities Sample QuantitiesMean

(or average)

Variance

Standard deviation

Sample mean

Sample proportion

Variance of the sample

Sample variance

Sample quasivariance

Standard deviation of the sample

Sample standard deviation

Sample quasi-standard deviation

Proportion

Parameter(s) Estimator of

Statistics and EstimatorsFor two

populations, similar

quantities can be written.

57

Random Quantities Nonrandom Quantities

Randomvariable

Sample(any)

Two important measures of the (model of the) variable X

Three important statistics to study the two important measures of the

variable X

Values of the three random statistics

when they are evaluated at a

specific sample x

Sample(a specific one)

Value of the random variable

(a specific one)

Estimator of

Parameter(s)

Estimate of Two important measures of any important

statistic (the sample mean, now) to study the two important measures of the variable X

Popu

latio

n Q

uant

ities

Sam

ple

Qua

ntiti

es

(X) (x)

Statistics and Estimators 58

X={X1 , X2 ,... , X n} x={x1 , x2 ,... , xn}

For two populations,

similar quantities can

be written.

Laws of large numbersIntuitively, sample relative frequencies tend to the population probabilities, which justifies why Statistics worksthe empirical histogram tends to the population histogram and both tend to the population probability distribution.

Linear combinations of normal variablesWhen a normally distributed variable is added to, subtracted from, multiplied by or divided by a quantity, we can know the normal distribution of the result. As a particular case, the probability distributions of the total sum and the sample mean are known.

Central limit theoremsFor any population probability distribution, these theorems allow us to know the asymptotic probability distribution of the total sum and the sample mean.

Fisher's theoremnonlinear combinations of normal variablesFor normally distributed population variables, this theorem allows us to have a result involving both the population variance and an estimator of it.

Others

Main Theorems

Statistics and EstimatorsProbability Theory provides results to compare population and sample information, and hence to support Statistics.

59

We are frequently interested in studying = E(X) and 2 = Var(X)or, more ambitiously, the parameters of the entire FX(x;)for one or more populations. We learn ways of finding estimators and evaluating their quality.

Statistics are made with estimators by applying well-known mathematical results (theorems); now we do not see the results but merely tabulate the statistics. Since any variable of a simple random sample follows the same distribution as X and the sample is used through statistics T, and 2 of X appear also in the expression of E(T) and Var(T).

Statistics for nonparametric methodsto study characteristics of the population distribution different from the mean or the varianceare also tabulated.

X is an estimator of

Statistic with which the sample information is used in a proper way to answer the statistical question. A

theoretical result (theorem) tells us its distribution,

which we use to calculate probabilities or find

quantiles

Sample Information

Number of populations: 1 or 2Type of population: normal (any n), any (big n) or Bernoulli (big n)

Inferential tool

Tables of StatisticsPopulation information

Parameter on which the statistical question is based

Knowledge about the other parameter of the distribution

Statistics and EstimatorsMotivation of the Statistics

60

Let us apply a mathematical zoom to some statistics:

X

S2nParameter: population mean (n1)S2

2

Estimator: sample mean

Dissimilarity: a comparison, based on a difference, between what the data say and the population value

Variability: this denominator is a measure of the order of magnitude of the spread of the data, to have a reference with which the dissimilarity is measured. It makes the quotient a dimensionless quantity.

Dissimilarity: a comparison, based on a quotient, between what the data say and the population value

Estimator: sample quasivariance

Parameter: population variance

Cautions: Even if dimensionless quantities are necessary not to depend on the units of measurement, it is also necessary to look at the different terms in the expression of statistics. Otherwise, too small or large values of a term can be hightlighted or hidden by other terms. We will insist on this fact several times.


Taken from: Solved Exercises and Problems of Statistical Inference. David Casado. http://www.Casado-D.org/edu/ExercisesProblemsStatisticalInference.pdf( )Basic Measures

Basic Estimators


http://www.casado-d.org/edu/ExercisesProblemsStatisticalInference.pdfhttp://www.casado-d.org/edu/ExercisesProblemsStatisticalInference.pdf

Basic Quantities and Estimators


Basic StatisticsX

2n=

Xn

=( X)n

X

S 2n=

XS n

=( X)n

S

Equivalent formulas:


X

s2n1=

Xs

n1

=( X) n1

s=

Basic Statistics


Basic Statistics

( XY )( XY )

S p2nX + S p2

nY

=( XY )( XY )

S p2 1nX + 1nY=( XY )( XY )

S p2 n X+nYn XnY=

[( XY )( XY )] n XnYnX +nY nX s X2 +nY sY2nX +n y2

Equivalent formula:


[( X Y )( XY ) ] n XnYn X+nY (nX1)S X2 +(nY1)S Y2n X+n y2

or

Basic Statistics


Tests Based on and Analysis of Variance (ANOVA)


Chi-Square Tests


Kolmogorov-Smirnov Tests


Other Tests


Tables of Statistics


Sections


Inference Theory

Types of Problem


Cases

Statistical Studies

Use of T's

76



One Population and One Variable

XX j

X F (x ;)

=E(X )2=Var (X )

f (x ;)

Tools of Probability Theory used for: Point Estimations Confidence Intervals Hypothesis Tests Other types of problem

Data

Formulas

X

Inferential Process

...

xx j

Mathematical representation of the variable in which we

are interested

Probability distribution to explain the

behaviour of X (a model for it). There is at least one unknown quantity we want to

study, unless we were in a simulated situation where an estimator is

being studied; the other quantities can be

known or unknown

Tables of statistics from which we select

an appropriate T, taking into account the

information about X and the sample size.

Types of statistical question (an easy

one, in this subject).

Cases 77

Two Populations and One Variable


Data

Formulas

Inferential Process

12

12/2

2

...

XX j

X F (x ; x)

=E(X )2=Var (X )

f (x ;x)

X

...

YY j

Y F ( y ; y )

=E(Y )2=Var (Y )

f ( y ; y)

Y

...

78Cases

xx j

yy j

One Population and Two Variables

(X ,Y )(X j , Y j)

X F (x , y ;)

1,1=E(XY )f (x , y ;)


Data

Formulas

(X ,Y )

Inferential Process

...

(x , y)(x j , y j)

Mathematical representation of the variables in which we

are interested. We can also talk about one bivariate random

variable.

Probability distribution to explain the joint behaviour of (X,Y),

and all the concepts around it: joint distribution and

probability functions, bivariate moments,

marginal distributions, conditional

distributions...

Tables of statistics from which we select

an appropriate T, taking into account the

information about X and the sample size.

Type of statistical question (an easy

one, in this subject). In these slides, we consider only the

nonparametric hypothesis test of

independence

r1,r 2=E(Xr 1Y r 2)

79Cases

Several Populations and One Variable

80Cases


Data

Formulas

Inferential Process

Tables of statistics from which we select an appropriate T, taking into account the information about X and the sample size.

Type of statistical question (an easy

one, in this subject). In these slides, we

consider only parametric hypothesis tests of the equality of meansAnalysis of Variance (ANOVA)

X (P)

X j(P)

X (P) F P(x ;P)

X

x(1)x j(1)

x(P )

x j(P )

... i= j i=1,2,... , P j=1,2,... , P?

X (1 )

X j(1 )

X (1 ) F1(x ;1)

X

X (2)

X j(2)

X (2) F2(x ;2)

X

x(2)

x j(2) ...

Populations, Variables and Statistical Techniques

X

(X (1) , X (2))

(X (1) , X (2) , ..., X (k ))

(X (1) , X (2) , ..., X (k ) ,...)

Univariate Statistical Methods: descriptive statistics, statistical inference, etc.

Bivariate Statistical Methods: descriptive statistics, statistical inference, simple regression, independence, etc.Multivariate Statistical Methods: descriptive statistics, statistical inference, multiple regression, independence, principal components, etc.Infinite-Dimensional Statistical Methods: discrete- and continuous-time stochastic processes, random functions, descriptive statistics, statistical inference, independence, etc.

One Population, (Quantitative) Variables

81Cases

Several Populations, One (Quantitative) Variable

X Y X(1) X (2) X (P )

...

Comparison of two populations Comparison of P populationsANOVA Homogeneity Hypothesis Tests

Number of populations

Number of variables

Number of data


Knowledge about other parameters Statistic T

1 (normal)1

n

2 known [23], [24], [25]

2 unknown [26]

2 known [27]

unknown [28]

1 (any)n large

2 known or unknown [29], [30], [31]

1 (Bernoulli) [32], [33]

2 (normal)

1

n

XYX

2 and Y2 known [34], [35]

X2 and Y

2 unknown [36], [41]

X2/Y

2X and Y known [37]

X and Y unknown [38]

2 (any) nX and nY large

XYX

2 and Y2 known or

unknown[42], [43]

2 (Bernoulli) XY [44], [45]

Cases 82

Main CasesNumber of

populationsNumber of variables

Number of data



P (normal) 1 nk large k = k2 = 2 unknown [60]

Number of populations

Number of variables

Number of data



1 1 n large F0(x;) [61], [68]

1? 1 nk large F(x|S) [64], [71]

1 2 n large f(x,y;) [66]

83Cases

N(0,1)

t

2

F1,2

N(,2) Any n

Bern()

Large n(> 30)

Bin(,)

P()

t

.

.

.

Probability distribution random variablesX and Y

can follow in this subject

Probability distribution statisticsT

in the previous tables can follow

Good news: we need only theprobability tables of these four cases.

Good news: two possible situationsnormal populations or many data.

Main Cases

84Cases

Sections


Inference Theory

Types of Problem


Cases

Statistical Studies

Use of T's

85



[1] Real-world problem: identify the quantities, the assumptions or hypotheses, and the main question [Economics, Business Administration, Finance, etc.]

[2] Translation into the mathematical language[3] Design of the whole process: the number of data needed, how to obtain them while guaranteeing the representativeness, and how to use these data [sampling process, mininum sample size, steps to solve exercises and problems]

[4] Theoretical calculations: e.g. the inferential methods we are learning [point estimations, confidence intervals, hypothesis tests]

[5] Data obtainment: collection of real data or generation of simulated data [others' real data and simulated data]

[6] Analysis of data: characteristics, erroneous data, missing data, outliers, treatment (e.g. remotion of the units of measurement) [descriptive statistics, standardization]

[7] Use of the data: with the theoretical expressions [substitution into formulas][8] Statistical interpretation: within the statistical framework [including: standard error, confidence, significance, types of error]

[9] Solution or answer: based on the interpretation of the results within the framework of the real-world problem [from the mathematical language to the real-world]

Statistical Study (Here, the contents of the subject are in this color.)

86Statistical Studies

Our Study [3] and [4] We mention how to calculate the number of data needed (minimum sample case) in

simple cases, as well as the basic ideas on sampling (simple random sampling). We learn some inferential methods (point estimations, confidence intervals, hypothesis tests) and whether they can be applied (assumptions or hypotheses). Besides, we mention the main theoretical results supporting them (theorems). Finally, we also design ways for solving the exercises and problems (steps).

[1], [2], [5], [7], [8], and [9] We apply the methods to real-world problems (Economics, Business Administration, Finance, etc.).

[6] We do not deal with analyses of data directlyalthough we frequently standardize to use some statistics T.

In GeneralIt is especially interesting to highlight: [1] and [2] To be allowed to use the theorems under T's, the assumptions or hypotheses must be

fulfilled. For any statistical study to be useful, the real-world problem must be well-stated (including the assumptions) and properly translated into the mathematical language.

[7] To base the results upon reliable data and formulas, we must pay attention to the values statistics T take but also to the values the terms they are made with take

[8] To interpret the results statistically, In point estimations, we must pay attention to the estimates but also to the standard error In confidence intervals, we must pay attention to the endpoints but also to the confidence In hypothesis tests, we must pay attention to the decision but also to the power function


The usual steps of a statistical study have already been mentioned. Even if we will focus on the inferential methods and their applications (step 4, 7 and 8), it is worthwhile mentioning at least once the importance of the quality of all those steps.

[1] Real-world problem: First of all, the convenience and appropriateness of the real-world problem.[2] Translation into the mathematical language: The translation into the mathematical language must also be correct, and this mathematization may sometimes be done with several degrees of quality.[3] Design of the whole process: The design of the whole statistical process determines the characteristics of the data: type of sampling and representativeness, sample size, registration...[4] Theoretical calculations: The statistical methods themselves: selection of the method, assumptions or hypotheses, theoretical calculations...[5] and [6] Data obtainment and analysis: In practice, attention must be paid to obtaining and analysing the data.[7] Use of the data: Data must be used in the right theoretical formulas in the right way. Usually, this is not a problem if we understood step 4 or even we did those calculations.[8] and [9] Statistical interpretation and solution or answer: The statistical interpretation of the results and its translation into the field of the real problem are quite important, obviously.

For the types of statistical problem we deal with, well-known statistical methods are already givenfor step 4in these slides. We will apply them either to practice their use or to solve particular real-world problems.

88

QualitiesStatistical Studies

89

In Statistics, results may change severely when assumptions are really false, other method is applied, different certainty is considered, or data has no proper information (representativity, quantity, quality, etc.). Alongside this document, we do insist on the cautions that statisticians and readers of statistical works must take in interpreting results. Even if you are not interested in statistically cooking data, you had better know the recipes... (Some of them have been included in the notes mentioned in the prologue.)

We hightlight once more the very basic points on which results are based: The data available. The assumptions. The statistical method applied, including particular details of its steps, mathematical

theorems and, finally, its precision. Certainty with which the method is applied: probability, confidence or significance.

Now, let us devote some words to what quality means in applying our methods.

Quality of Statistical MethodsWe would like, though it is not possible in Statistics, that for any sample X:

The point estimator provides always the true value: The interval contains always the true value: The test provides always the right decision:


P ( (X )= )=1 , X

P (I (X))=1 , X

P ({Rc|H 0}{Ra|H1})=0 , X

90

In studying a population quantity, say , all methods work with estimationsvalues calculated from the data that are expected to be close to the real value of (only under simulated situations, e.g. in exercises or practicals, is known).

Estimations are provided by an estimator, say . Each time data are substituted in the estimator, the value provided is differentwe have already talked about randomness. Then, how different do these values tend to be?, how to measure the error and the quality of an estimator? For confidence intervals, formulas will also provide different endpoints for different data. Finally, in testing hypothesis the decisions are based on values that are also different each time.

Under good conditions, the essential information in different samples is quite similar even if most of their values are different. Either way, the knowlegde about the possible samples or the possible values of the estimator plus the knowlegde about their probabilities allow us to have a reference and hence to measure how similar or different the results tend to be.

ErrorsFor a particular sample, we can talk about an estimate and the classical errors (we simplify the notation of the estimator):

(x1, ... , x n)

Absolute

errorRelative

error


The sign can be removed from the previous quantities by considering the absolute value or the square. These errors appear in concepts

like bias, mean square error, consistency, etc.

91

For any sample, we must talk about an estimator and consider that the previous errors are also random, so they can take different values with some probabilitiesthese are the probability distributions explaining their behaviour (the concept of sampling distribution that we will study). Thus, in Statistics we cannot usually talk about the classical errors but about conceptspositive, usuallyrelated to the probability distributions of the estimator itself or the classical errors, namely:

In literature, the word precision can be referring to any of these concepts and even some others (e.g. the variance of the estimator but also the inverse of the variance of the estimator). The probability that appears in the two probabilistic expressions is termed confidence, and it is a measure of the strengh with which the bound is ensured.

Due to the purpose of confidence intervals and hypothesis tests, the quality is measuredalthough the quantities above can be involved in the processthrough the length and the confidence, for the former, and the probability of making wrong decisions for different real values of , for the latter.

We will learn how to interpret the different measures of quality.

(X1 , ... , Xn)

=Var ()=E([ ]2)=E([ E()]2) CV= | |

E such that P(||E) E([]2)E such that P(| |E)

Note: Mathematically, an expression like can also exist for random, though such deterministic bounds are not frequent or useful.

||E


E(||)

The interpretation depends on whether they compare the estimator with the true

value or not, they involve a deterministic or a probabilistic bound, they penalize large differences (e.g. using an exponent), they

are dimensionless or not, etc.

On the Populations How many populations are there? Are their probability distributions known?

On the Samples If populations are not normally distributed, are the sample sizes large enough to

apply asymptotic results? Do we know the data themselves, or only some quantities calculated from them?

On the Assumptions What is supposed to be true? Does it seem reasonable? Do we need to prove it? Should it be checked for the populations: the random character, the

independence of the populations, the goodness-of-fit to the supposed models, the homogeneity between the populations, et cetera?

Should it be checked for the samples: the within-sample randomness and independence, the between- -samples independence, et cetera?

Are there other assumptions (neither mathematical nor statistical)?

On the Statistical Problem What are the quantities to be studied statistically?

Useful Questions


Concretely, what is the statistical problem: point estimation, confidence interval, hypothesis test, etc?

On the Statistical Tools Which are the estimators, the statistics and the methods that will be applied?

On the Quantities Which are the units of measurement? Are all the units equal? How large are the magnitudes? Do they seem reasonable? Are all of them

coherent (variability is positive, probabilities and relative frequencies are between 0 and 1, etc)?

On the Interpretation What is the statistical interpretation of the solution? How is the statistical solution interpreted in the framework of the problem we are

working on? Do the qualitative results seem reasonable (as expected)? Do the quantities seem reasonable (signs, order of magnitude, etc)?

Useful Questions


Sections


Inference Theory

Types of Problem


Cases

Statistical Studies

Use of T's

94



Main statistics are summarized in tables

The measures = E(X) and 2 = Var(X) are two main moments of the probability distribution of X. Any variable of a simple random sample follows this distribution too (it is a copy) and the sample is used through statistics T, which explains that and 2 of X appear also in the expression of E(T) and Var(T).Statistics for nonparametric methodsto study characteristics of the population distribution other than the mean or the varianceare also tabulated.

Use of T's 95

Population information Number of populations: 1 or 2Type of population: normal (any n), any (big n) or Bernoulli (big n)Sample Information

Inferential tool

Statistic with which the sample information is used in a proper way

to answer the statistical question. A theoretical result (theorem) tells us

its distribution, which we use to calculate probabilities or find

quantiles

X is an estimator of

Parameter on which the statistical

question is based

Knowledge about the other parameter of

the distribution

They can be used under some assumptions or hypotheses, and their use is not justified otherwise.

To use them, we need to make them appear in the expressions we are working with.

Statistics T are mathematical theorems comparing population and sample information, and, therefore, allowing us to quantify our statistical statements and answers.

How T is Usually UsedWhat T's do

96Use of T's

In the following slides we use one-population cases and easy questions because they are easier to understand at the beginning. You may not try to understand all details the first time you read them, come back to these slides while preparing the other contents.

1. Select TBy reading the statement, we identify:

The main characteristics of the population: assumptions or hypotheses, knowledge about the population distribution (measures, parameters, etc.)

The main characteristics of the sample: type of sampling, quality of data, quantity of data (size n).

The statistical question: type of problem, translation into the mathematical language, quantities involved (e.g. estimators)

2. Rewrite the QuestionThe question is usually made in terms of estimators. Since only in few cases we can know their sampling distribution, we need to rewrite the event so as to make the statistic T appear. In rewriting the question or event, we must take into account that:

There