4
MTH 4106 Introduction to Statistics Notes 1 Spring Semester 2015 What is Statistics about? We collect data, then analyse the data, and then interpret the results, to find out about real-world phenomena. There is always variability in the data: we need to extract meaningful patterns. Because of the variability, our conclusions cannot be certain. We need to quantify the uncertainty, so that we can make definite decisions and judge how likely it is that we are right. Here are some areas of application, with some typical problems. Agriculture Which varieties of wheat make the best bread? Manufacturing Can we make a cheaper detergent that is just as effective as the current one? Health Does an aspirin a day protect against stroke? If so, are there any side-effects? Education What is the best way to teach young children mental arith- metic? Biology How does biodiversity affect the environment? Social science Do people live in better houses than they did 20 years ago? Economics How is the credit crunch affecting food prices? Market research What sort of advertising campaign is most effective? Environmental studies Are people who live near mobile-phone masts more likely to get cancer? Meteorology Is global warming a reality? Psychology Are shyness and loneliness related? Before starting any of these investigations, we need to stop and ask: What do we want to investigate? What should we measure? How should we measure it? 1

1. What is Statistics About

Embed Size (px)

DESCRIPTION

stats

Citation preview

Page 1: 1. What is Statistics About

MTH 4106 Introduction to Statistics

Notes 1 Spring Semester 2015

What is Statistics about?

We collect data, then analyse the data, and then interpret the results, to find out aboutreal-world phenomena. There is always variability in the data: we need to extractmeaningful patterns. Because of the variability, our conclusions cannot be certain. Weneed to quantify the uncertainty, so that we can make definitedecisions and judge howlikely it is that we are right.

Here are some areas of application, with some typical problems.Agriculture Which varieties of wheat make the best bread?Manufacturing Can we make a cheaper detergent that is just aseffective

as the current one?Health Does an aspirin a day protect against stroke? If so, are

there any side-effects?Education What is the best way to teach young children mentalarith-

metic?Biology How does biodiversity affect the environment?Social science Do people live in better houses than they did 20 years

ago?Economics How is the credit crunch affecting food prices?Market research What sort of advertising campaign is most effective?Environmental studies Are people who live near mobile-phone masts more likely

to get cancer?Meteorology Is global warming a reality?Psychology Are shyness and loneliness related?

Before starting any of these investigations, we need to stopand ask:

What do we want to investigate?What should we measure?How should we measure it?

1

Page 2: 1. What is Statistics About

Populations and Samples

When we carry out a statistical investigation, we want to findout about a population.

Definition A population is the collection of items under discussion. It may be finiteor infinite; it may be real or hypothetical.

Sometimes, although we have atarget population in mind, thestudy populationthat we can actually find out information about may be different.

We are interested in measuring one or morevariables for the members of thepopulation, but to record observations for everyone would be costly. The governmentcarries out such acensus of the population every ten years, but also carries out regularsurveys based onsamples of a few thousand.

Definition A sample is a subset of a population.

The sample should be chosen to be representative of the population, because weusually want to draw conclusions orinferences about the population based on thesample. Samples will vary and the question of whether the data in the sample arecompatible with hypotheses we may have about the populationwill be consideredboth in this course and inStatistical Methods.

For each member of the sample, we will measure one (or more) random variables.We usually assume something about the distribution of the random variables. Forexample, if our data were counts of radioactive particles from a sample of radioactivesources manufactured to have the same mean, it would be reasonable to assume that

Xi ∼ Poisson(λ).

This assumption is called amodel, andλ is called aparameter.We will not concern ourselves much with the mechanics of how the sample is

chosen, but the following examples give you some idea of the sorts of problems:

(a) A city engineer wants to estimate the average weekly water consumption forsingle-family dwellings in the city.

The population is single-family dwellings in the city. The variable that we wantto measure is water consumption. To collect a sample if the dwellings havewater meters, it might be best to get lists of dwellings and annual usage directlyfrom the water company. If not, then the local authority should have lists ofaddresses which can be sampled from. Note that we should collect data throughthe year, as water consumption will be seasonal. Note also that, if there is nowater meter, measuring how much water the household uses maybe problem-atic.

2

Page 3: 1. What is Statistics About

(b) A political scientist wants to determine if a majority ofvoters favour an electedHouse of Lords.

The population is voters in the UK. Electoral rolls provide alist of those eligibleto vote. What we want to measure is their opinion on this issueusing a neutralquestion. (It would be easy to bias the response by asking a leading question.)We could choose a sample using the electoral roll, and then ask the question bypost, on the telephone or face to face, but all of these methods have problems ofnon-response and/or cost.

(c) A medical scientist wants to estimate the average lengthof time until the recur-rence of a certain disease.

The population is people who are suffering from this diseaseor have done in thepast. What we want to measure are the dates of the last bout of disease and thenew bout of disease. We could take a sample of patients suffering the diseasenow and follow them until they have another bout. This may be too slow ifthe disease does not recur often. Alternatively, we could use medical recordsof people who suffered the disease in one or more hospitals, but records can bewrong and there may be biases introduced.

(d) An electrical engineer wants to determine if the averagelength of life of transis-tors of a certain type is greater than 5,000 hours.

The population is transistors of this type. We want to recordthe length of timeto failure by putting a sample of transistors on test and recording when they fail.Note that, for such experiments where the items under test are very reliable, itmay be necessary to use an “accelerated” test where we subject the items tohigher currents than usual, although this might introduce biases.

In other parts of the course, we may not emphasise the underlying population orexactly how we collect a sample, but remember that these questions have had to beconsidered.

Three methods of collecting data

1) Take asample from apopulation.

Do we ask questions (in which case, how do we word the questionnaire?), ortake “objective” measurements, such as blood pressure?

This is called asurvey. If the sample is the whole population, it is called acensus.

3

Page 4: 1. What is Statistics About

2) Design anexperiment.

This means that we apply different treatments to different experimental units,and then measure something to see if there is a difference between the treat-ments.

How do we choose the treatments?

How do we choose the experimental units?

How do we decide who or what is given which treatment?

(See the courseDesign of Experiments.)

3) If it is impractical or unethical to impose our choice of treatments, we may doanobservational study. We might compare the effect of things that people canchange themselves (for example, diet, or whether they go to the gym), or thingsthat they cannot change (such as height, or place of birth).

Some practical examples

1. The BBC wants to know how many people watch each of its programmes.

Population = all people in the UK

Sample = panel of people, chosen to be representative.

Each member of the panel keeps a diary recording all of their TV viewing fora week, and then sends it to the BBC. The BBC uses these data to estimate thetotal number of people who watched each programme. This is a survey.

2. Health researchers want to know which lifestyle factors affect the chances ofgetting various diseases. The UK BioBank has recently recruited 500,000 vol-unteers. The UK Biobank people collect some information now(for example,“Do you drink full-cream milk?”), and then follow the person’s medical recordsuntil they die, recording which diseases they get. They willthen be able to testhypotheses such as “If you cycle daily, you are less likely tohave a stroke”. Thisis an observational study.

3. A marine engineer wants to know if a new sort of paint protects pier supportsfrom corrosion. He paints ten metal beams, and leaves a further ten beamsunpainted (why?). He puts all of the beams in a tank of sea water for threemonths, and then measures the amount of corrosion in each beam. This is anexperiment.

4