48
Descriptive Statistics F. Farrokhyar, MPhil, PhD, PDoc Department of Surgery Department of Clinical Epidemiology and Biostatistics March 18, 2009

Descriptive Statistics F. Farrokhyar, MPhil, PhD, PDoc Department of Surgery Department of Clinical Epidemiology and Biostatistics March 18, 2009

Embed Size (px)

Citation preview

Page 1: Descriptive Statistics F. Farrokhyar, MPhil, PhD, PDoc Department of Surgery Department of Clinical Epidemiology and Biostatistics March 18, 2009

Descriptive StatisticsDescriptive Statistics

F. Farrokhyar, MPhil, PhD, PDoc

Department of SurgeryDepartment of Clinical Epidemiology and Biostatistics

March 18, 2009

Page 2: Descriptive Statistics F. Farrokhyar, MPhil, PhD, PDoc Department of Surgery Department of Clinical Epidemiology and Biostatistics March 18, 2009

Objectives Objectives

To understand and recognize different types of variables

To learn how to explore your data

◙ How to display data with numbers and tables

◙ How to display data using graphs

To understand the fundamental concept of variability

To learn the notion of the distribution of a variable

Page 3: Descriptive Statistics F. Farrokhyar, MPhil, PhD, PDoc Department of Surgery Department of Clinical Epidemiology and Biostatistics March 18, 2009

Why and how are statistics relevant to medicine? Why and how are statistics relevant to medicine?

Prevention – What causes a disease?

Diagnosis – What symptoms and signs do patients with a given disease present with?

Treatment – What treatments are effective for a given disease and for which patients?

Prognosis – How will specific patients with a given disease fare in the long term?

Page 4: Descriptive Statistics F. Farrokhyar, MPhil, PhD, PDoc Department of Surgery Department of Clinical Epidemiology and Biostatistics March 18, 2009

Statistics – Why do we need it? Statistics – Why do we need it?

Page 5: Descriptive Statistics F. Farrokhyar, MPhil, PhD, PDoc Department of Surgery Department of Clinical Epidemiology and Biostatistics March 18, 2009

BA E W

D S A Q PB B W E O N F

O H E E R D T TY E D T E Q O N E G G O L

T S D G F E W G E G G V B A Y A O E E D Y H E J U E G D

E T E W W E T H E F E O P L U M R

HOW MANY

‘E”’s?

Page 6: Descriptive Statistics F. Farrokhyar, MPhil, PhD, PDoc Department of Surgery Department of Clinical Epidemiology and Biostatistics March 18, 2009

Descriptive and Inferential statistics? Descriptive and Inferential statistics?

Descriptive statistics are concerned with the presentation, organization, and summarization of data

Inferential statistics allow us the generalization from a sample to a larger group of subjects.

Page 7: Descriptive Statistics F. Farrokhyar, MPhil, PhD, PDoc Department of Surgery Department of Clinical Epidemiology and Biostatistics March 18, 2009

What is data? What is data?

Data is collected for some purpose and each collected information have a meaning in some context.

Data is a set of information or observation about a group of individuals or subjects.

This information is organized in form of variables.

A variable is any characteristic of a person or a subject that can be measured or categorized and its value varies from individual to individual.

Page 8: Descriptive Statistics F. Farrokhyar, MPhil, PhD, PDoc Department of Surgery Department of Clinical Epidemiology and Biostatistics March 18, 2009

Dependent and Independent Variables? Dependent and Independent Variables?

Independent variable Is the explanatory variable that explains the changes in the

dependent variable demographics (age, gender, height), risk factors (diabetes, CAD)

Is the intervention or exposure that causes the changes in the dependent variable. drug, surgery, radiation, smoking …

Dependent variable Is the outcome of interest, which changes in response to some

intervention or exposure. mortality, survival, post-op pain, quality of life, post-op complications

Page 9: Descriptive Statistics F. Farrokhyar, MPhil, PhD, PDoc Department of Surgery Department of Clinical Epidemiology and Biostatistics March 18, 2009

Type of variables …? Type of variables …?

Qualitative or attribute variable Nonnumeric

gender, severity of injury, type of injury, tumour grade

Categorical variables…

Quantitative variable Numeric

Discrete variable can assume only whole numbers: number of accidents, number of injuries, pain scoreContinuous variable may take any value, within a defined range: weight, height, age, blood pressure, level of cholesterol, pain score

Page 10: Descriptive Statistics F. Farrokhyar, MPhil, PhD, PDoc Department of Surgery Department of Clinical Epidemiology and Biostatistics March 18, 2009

Level of measurement …Level of measurement … There are four level of measurement:

◙ Nominal

◙ Ordinal

◙ Interval

◙ Ratio

Qualitative/Categorical

Quantitative/Numeric

Page 11: Descriptive Statistics F. Farrokhyar, MPhil, PhD, PDoc Department of Surgery Department of Clinical Epidemiology and Biostatistics March 18, 2009

Level of measurement … cont’dLevel of measurement … cont’d

Variable type: ◙ Nominal

◙ Ordinal

.◙ Interval

.◙ Ratio

Assumptions:

◙ Named categories◙ Same as nominal plus

ordered categories◙ Same as ordinal plus equal

intervals◙ Same as interval plus

meaningful zero

Page 12: Descriptive Statistics F. Farrokhyar, MPhil, PhD, PDoc Department of Surgery Department of Clinical Epidemiology and Biostatistics March 18, 2009

Level of measurement … cont’dLevel of measurement … cont’d

A nominal variable: consists of named categories, with no implied order among the categories.

- gender, mortality ---- dichotomous or binary

- type of injury, type of fracture, blood type

An ordinal variable: consists of ordered categories, where the differences between categories cannot be considered to be equal.

- Tumour stage – I, II, III, IV, tumour grade – I II, III, IV

- Likert scale – excellent, very good, good, fair, poor

Page 13: Descriptive Statistics F. Farrokhyar, MPhil, PhD, PDoc Department of Surgery Department of Clinical Epidemiology and Biostatistics March 18, 2009

Level of measurement … cont’dLevel of measurement … cont’d

An interval variable: has equal distances between values with no meaningful ‘zero’ value.

- IQ test (the differences between numbers are meaningful but the ratios between them are not)

An ratio variable: has equal intervals between values and a meaningful zero point. The ratio between them makes sense.

- height, weight, laboratory test values, age

Page 14: Descriptive Statistics F. Farrokhyar, MPhil, PhD, PDoc Department of Surgery Department of Clinical Epidemiology and Biostatistics March 18, 2009

Primary objective: To compare the post-operative pain between laparoscopic and open surgery in patients with colorectal cancer

Secondary objective: To compare the post-operative complications between laparoscopic and open surgery in patients with colorectal cancer

For exampleFor example

Page 15: Descriptive Statistics F. Farrokhyar, MPhil, PhD, PDoc Department of Surgery Department of Clinical Epidemiology and Biostatistics March 18, 2009

Independent (Explanatory) variables:Age, Sex, Pre-op pain Severity

Dependent/outcome variables:Changes in pain, Complication

Independent (Comparison) variable

Page 16: Descriptive Statistics F. Farrokhyar, MPhil, PhD, PDoc Department of Surgery Department of Clinical Epidemiology and Biostatistics March 18, 2009

Data Editing Data Editing Validity edits: Ensure that:

essential fields have been completed and there are no missing information

◘ specified units of measure have been properly used and the measurements are within the acceptable range.

Duplication edits: Ensure that each case/patient have been entered into the database only once.

Statistical edits: Identify and double check all the extreme values, suspicious data and outliers.

Page 17: Descriptive Statistics F. Farrokhyar, MPhil, PhD, PDoc Department of Surgery Department of Clinical Epidemiology and Biostatistics March 18, 2009

Descriptive StatisticsDescriptive Statistics

… are a means of organizing and summarizing observations.

We examine variables in order to describe their main features.

It is the basic strategies that help us organize our exploration of a set of data:

◙ Begin by examining each variable.

◙ Examine the distribution of each variable by creating frequency tables, numerical summaries and graphs.

◙ Study the relationships between the variables.

Page 18: Descriptive Statistics F. Farrokhyar, MPhil, PhD, PDoc Department of Surgery Department of Clinical Epidemiology and Biostatistics March 18, 2009

Examining Distributions: Categorical …Examining Distributions: Categorical …

Numbers

Frequencies (counts), cumulative frequencies

Relative frequencies (%), cumulative relative frequencies (%)

Graphs

Bar charts

Pie charts

Page 19: Descriptive Statistics F. Farrokhyar, MPhil, PhD, PDoc Department of Surgery Department of Clinical Epidemiology and Biostatistics March 18, 2009

Cross-tabulation of categorical dataCross-tabulation of categorical data

Severity of disease

7 23.3 23.3 23.3

13 43.3 43.3 66.7

10 33.3 33.3 100.0

30 100.0 100.0

0

1

2

Total

ValidFrequency Percent Valid Percent

CumulativePercent

Page 20: Descriptive Statistics F. Farrokhyar, MPhil, PhD, PDoc Department of Surgery Department of Clinical Epidemiology and Biostatistics March 18, 2009

13 86.7% 11 73.3%

2 13.3% 4 26.7%

15 100.0% 15 100.0%

No

Yes

Total

ComplicationsCount Column N %

Open

Count Column N %

Lap

Type of surgery

Cross-tabulation of categorical dataCross-tabulation of categorical data

Page 21: Descriptive Statistics F. Farrokhyar, MPhil, PhD, PDoc Department of Surgery Department of Clinical Epidemiology and Biostatistics March 18, 2009

Examining Distributions: Categorical …Examining Distributions: Categorical …

Numbers

Frequencies (counts), cumulative frequencies

Relative frequencies (%), cumulative relative frequencies (%)

Graphs

Bar charts

Pie charts

Page 22: Descriptive Statistics F. Farrokhyar, MPhil, PhD, PDoc Department of Surgery Department of Clinical Epidemiology and Biostatistics March 18, 2009

Bar ChartsBar Charts

Page 23: Descriptive Statistics F. Farrokhyar, MPhil, PhD, PDoc Department of Surgery Department of Clinical Epidemiology and Biostatistics March 18, 2009

Bar ChartsBar Charts

Page 24: Descriptive Statistics F. Farrokhyar, MPhil, PhD, PDoc Department of Surgery Department of Clinical Epidemiology and Biostatistics March 18, 2009

Bar charts …Bar charts …

A bar chart can be used to depict any levels of measurement (nominal, ordinal, interval, or ratio).

A series of separated bars (vertical or Horizontal), one per category.

Bars represent frequency (counts) or relative frequency (percent or proportion) of each category.

A Bar chart is also useful for showing data for more than one group.

Page 25: Descriptive Statistics F. Farrokhyar, MPhil, PhD, PDoc Department of Surgery Department of Clinical Epidemiology and Biostatistics March 18, 2009

Pie ChartsPie Charts

Page 26: Descriptive Statistics F. Farrokhyar, MPhil, PhD, PDoc Department of Surgery Department of Clinical Epidemiology and Biostatistics March 18, 2009

Pie charts …Pie charts …

Used primarily for nominal and ordinal data.

Used to display relative frequency distribution.

The circle is divided proportionally using relative frequency of each category.

A pie chart is useful for showing data for one group but it is useless for graphic illustration of two or more groups.

Page 27: Descriptive Statistics F. Farrokhyar, MPhil, PhD, PDoc Department of Surgery Department of Clinical Epidemiology and Biostatistics March 18, 2009

Examining Distributions: Quantitative …Examining Distributions: Quantitative … Numbers

Measures of central tendency – mean, median, mode

Measures of variation around mean – variance, standard deviation, standard error of mean

Measures of variation around median – percentiles, quintiles, quartiles

Graphs

Histograms

The five-number summary Box plots

Page 28: Descriptive Statistics F. Farrokhyar, MPhil, PhD, PDoc Department of Surgery Department of Clinical Epidemiology and Biostatistics March 18, 2009

Mean: sum of observations divided by number of observations

Median: is a midpoint of a distribution after arranging all observations in order of size, from smallest to largest.

Mode: most frequent value – the highest peak

Measures of central tendencyMeasures of central tendency

n

x=X

n

1=ii∑

Page 29: Descriptive Statistics F. Farrokhyar, MPhil, PhD, PDoc Department of Surgery Department of Clinical Epidemiology and Biostatistics March 18, 2009

Properties of mean …Properties of mean … It is used for interval or ratio data.

A set of data has only a mean.

All values are included in the computation.

It is the only measure of central tendency where the sum of deviations of each value from the mean will always be zero.

The mean is a useful measures for comparing two or more sets of data.

The mean is sensitive toward extreme values.

-∑ )XX(_n

1=ii

Page 30: Descriptive Statistics F. Farrokhyar, MPhil, PhD, PDoc Department of Surgery Department of Clinical Epidemiology and Biostatistics March 18, 2009

Properties of median …Properties of median … It is used for interval or ratio data.

There is a unique median for each data set.

The median is not necessarily equal to one of the sample values.

It is resistant (insensitive) toward extreme values.

It is useful for summarising skewed data.

Page 31: Descriptive Statistics F. Farrokhyar, MPhil, PhD, PDoc Department of Surgery Department of Clinical Epidemiology and Biostatistics March 18, 2009

Variance: the average of the squares of the deviations of the data from their mean

Standard deviation: square root of variance

Standard error:

Measures of variation around meanMeasures of variation around mean

∑-

-n

1=i

2i2

1n

)xx(=σ

∑-

-n

1=i

2i

1n

)xx(=σ

n

σ=.e.s

Page 32: Descriptive Statistics F. Farrokhyar, MPhil, PhD, PDoc Department of Surgery Department of Clinical Epidemiology and Biostatistics March 18, 2009

Properties of variance …Properties of variance …

All values are used on calculation.

The units are not the same as data, they are the square of the original units.

Page 33: Descriptive Statistics F. Farrokhyar, MPhil, PhD, PDoc Department of Surgery Department of Clinical Epidemiology and Biostatistics March 18, 2009

Properties of standard deviation …Properties of standard deviation …

The units are the same as data

It is used for Empirical Rule.

For any symmetrical distribution:

◘ About 68% of the observations will lie within 1 s. d. of the mean.

◘ About 95% of the observations will lie within 2 s. d. of the mean.

◘ About 99.8% of the observations will lie within 3 s. d. of the mean.

Page 34: Descriptive Statistics F. Farrokhyar, MPhil, PhD, PDoc Department of Surgery Department of Clinical Epidemiology and Biostatistics March 18, 2009

The Empirical RuleThe Empirical Rule

Page 35: Descriptive Statistics F. Farrokhyar, MPhil, PhD, PDoc Department of Surgery Department of Clinical Epidemiology and Biostatistics March 18, 2009

Measures of variation around medianMeasures of variation around median

Percentiles:

Arrange the observations from smallest to largest.

Divide into 100 equal parts;

for example; the 5th percentiles of a distribution is the value which 5% of the observations fall below and 95% fall above.

Quartiles: 25th, 50th and 75th percentiles

Quintiles: 20th, 40th, 60th, and 80th percentiles

Deciles: 10th, 20th, 30th, 40th, 50th,……10th percentiles

Page 36: Descriptive Statistics F. Farrokhyar, MPhil, PhD, PDoc Department of Surgery Department of Clinical Epidemiology and Biostatistics March 18, 2009

Statistics

Age30

0

63.87

1.494

64.00

58a

8.182

66.947

38

44

82

58.75

64.00

69.50

Valid

Missing

N

Mean

Std. Error of Mean

Median

Mode

Std. Deviation

Variance

Range

Minimum

Maximum

25

50

75

Percentiles

Multiple modes exist. The smallest value is showna.

Page 37: Descriptive Statistics F. Farrokhyar, MPhil, PhD, PDoc Department of Surgery Department of Clinical Epidemiology and Biostatistics March 18, 2009

Examining Distributions: Quantitative …Examining Distributions: Quantitative … Numbers

Measures of central tendency; mean, median, mode

Measures of variation around mean – variance, standard deviation, standard error of mean

Measures of variation around median – percentiles, quintiles, quartiles

Graphs

Histograms

The five-number summary Boxplot

Page 38: Descriptive Statistics F. Farrokhyar, MPhil, PhD, PDoc Department of Surgery Department of Clinical Epidemiology and Biostatistics March 18, 2009

HistogramHistogram

Outliers??

Page 39: Descriptive Statistics F. Farrokhyar, MPhil, PhD, PDoc Department of Surgery Department of Clinical Epidemiology and Biostatistics March 18, 2009

Histograms …Histograms …

Used for interval and ratio data.

A histogram is a graph in which each bar (horizontal axis) represent a range of numbers called interval width. The vertical axis represents the frequency of each interval.

There are no spaces between bars.

Histogram is useful for graphic illustration of one group.

Page 40: Descriptive Statistics F. Farrokhyar, MPhil, PhD, PDoc Department of Surgery Department of Clinical Epidemiology and Biostatistics March 18, 2009

Box plot: 5 – number summaryBox plot: 5 – number summary

Range = Max - Min

IQR = Q3 – Q1

Whiskers

Whiskers Q3

Q1Median/Q2

Inner fence

Inner fence

Outliers

1st

100th

Page 41: Descriptive Statistics F. Farrokhyar, MPhil, PhD, PDoc Department of Surgery Department of Clinical Epidemiology and Biostatistics March 18, 2009

Box plot of change in pain score Box plot of change in pain score

Page 42: Descriptive Statistics F. Farrokhyar, MPhil, PhD, PDoc Department of Surgery Department of Clinical Epidemiology and Biostatistics March 18, 2009

Box Plots …Box Plots …

Used for interval and ratio data.

Uses the five-number summary measures

Median, Q1, Q3, minimum and maximum.

It is useful in detecting outliers

It is useful to illustrate the distribution of more than on group.

Page 43: Descriptive Statistics F. Farrokhyar, MPhil, PhD, PDoc Department of Surgery Department of Clinical Epidemiology and Biostatistics March 18, 2009

What are outliers … ?What are outliers … ?

Outliers are extreme data values that fall outside of distribution of the data set.

Page 44: Descriptive Statistics F. Farrokhyar, MPhil, PhD, PDoc Department of Surgery Department of Clinical Epidemiology and Biostatistics March 18, 2009

Box plot: 5 – number summaryBox plot: 5 – number summary

IQR = Q3 – Q1

Whiskers

Whiskers Q3

Q1Median/Q2

Inner fence

Inner fence

1st

100th

Page 45: Descriptive Statistics F. Farrokhyar, MPhil, PhD, PDoc Department of Surgery Department of Clinical Epidemiology and Biostatistics March 18, 2009

1.5 IQR Criterion for Outliers1.5 IQR Criterion for Outliers

Interquartile range (IQR) is the distance between the first and third quartiles. IQR = Q3 – Q1

From data

Q1 = 59 yrs, Q3 = 70 yrs,

IQR = 70 – 59 = 11

1.5 IQR = 1.5 11 = 16.5

Q1 – IQR = 59 – 16.5 = 42.5

Q3 + IQR = 70 + 16.5 = 86.5

From data: Min= 44 and Max = 82

Page 46: Descriptive Statistics F. Farrokhyar, MPhil, PhD, PDoc Department of Surgery Department of Clinical Epidemiology and Biostatistics March 18, 2009

Properties of quartiles, quintiles…Properties of quartiles, quintiles…

It is used for interval or ratio data.

It is resistant (insensitive) to extreme values.

It is useful for summarising skewed data.

Page 47: Descriptive Statistics F. Farrokhyar, MPhil, PhD, PDoc Department of Surgery Department of Clinical Epidemiology and Biostatistics March 18, 2009

How to deal with skewed dataHow to deal with skewed data

Transform the data: Square/square root – (Poisson) count dataLog(x) or ln(x) – data is skewed toward rightReciprocal (1/X) - data is skewed toward left

Transformation:Make skewed data more symmetric Makes distribution more normal Stabilize variabilityLiberalize a relationship between two or more variables

Show summary stat in original but analyse on the transformed data

Page 48: Descriptive Statistics F. Farrokhyar, MPhil, PhD, PDoc Department of Surgery Department of Clinical Epidemiology and Biostatistics March 18, 2009

Summary of what we have learned ….Summary of what we have learned ….

Always plot your data: make a graph, e.i. histogram, box plot

Look for overall pattern (shape, centre and spread) and for striking deviations such as outliers

Check to see if overall pattern of distribution can be described by normal distribution.

If not uniform, transform data to make skewed data more symmetric

Calculate an appropriate numerical summary to describe centre and spread