Descriptive Statistics F. Farrokhyar, MPhil, PhD, PDoc Department of Surgery Department of Clinical...

Preview:

Citation preview

Descriptive StatisticsDescriptive Statistics

F. Farrokhyar, MPhil, PhD, PDoc

Department of SurgeryDepartment of Clinical Epidemiology and Biostatistics

March 18, 2009

Objectives Objectives

To understand and recognize different types of variables

To learn how to explore your data

◙ How to display data with numbers and tables

◙ How to display data using graphs

To understand the fundamental concept of variability

To learn the notion of the distribution of a variable

Why and how are statistics relevant to medicine? Why and how are statistics relevant to medicine?

Prevention – What causes a disease?

Diagnosis – What symptoms and signs do patients with a given disease present with?

Treatment – What treatments are effective for a given disease and for which patients?

Prognosis – How will specific patients with a given disease fare in the long term?

Statistics – Why do we need it? Statistics – Why do we need it?

BA E W

D S A Q PB B W E O N F

O H E E R D T TY E D T E Q O N E G G O L

T S D G F E W G E G G V B A Y A O E E D Y H E J U E G D

E T E W W E T H E F E O P L U M R

HOW MANY

‘E”’s?

Descriptive and Inferential statistics? Descriptive and Inferential statistics?

Descriptive statistics are concerned with the presentation, organization, and summarization of data

Inferential statistics allow us the generalization from a sample to a larger group of subjects.

What is data? What is data?

Data is collected for some purpose and each collected information have a meaning in some context.

Data is a set of information or observation about a group of individuals or subjects.

This information is organized in form of variables.

A variable is any characteristic of a person or a subject that can be measured or categorized and its value varies from individual to individual.

Dependent and Independent Variables? Dependent and Independent Variables?

Independent variable Is the explanatory variable that explains the changes in the

dependent variable demographics (age, gender, height), risk factors (diabetes, CAD)

Is the intervention or exposure that causes the changes in the dependent variable. drug, surgery, radiation, smoking …

Dependent variable Is the outcome of interest, which changes in response to some

intervention or exposure. mortality, survival, post-op pain, quality of life, post-op complications

Type of variables …? Type of variables …?

Qualitative or attribute variable Nonnumeric

gender, severity of injury, type of injury, tumour grade

Categorical variables…

Quantitative variable Numeric

Discrete variable can assume only whole numbers: number of accidents, number of injuries, pain scoreContinuous variable may take any value, within a defined range: weight, height, age, blood pressure, level of cholesterol, pain score

Level of measurement …Level of measurement … There are four level of measurement:

◙ Nominal

◙ Ordinal

◙ Interval

◙ Ratio

Qualitative/Categorical

Quantitative/Numeric

Level of measurement … cont’dLevel of measurement … cont’d

Variable type: ◙ Nominal

◙ Ordinal

.◙ Interval

.◙ Ratio

Assumptions:

◙ Named categories◙ Same as nominal plus

ordered categories◙ Same as ordinal plus equal

intervals◙ Same as interval plus

meaningful zero

Level of measurement … cont’dLevel of measurement … cont’d

A nominal variable: consists of named categories, with no implied order among the categories.

- gender, mortality ---- dichotomous or binary

- type of injury, type of fracture, blood type

An ordinal variable: consists of ordered categories, where the differences between categories cannot be considered to be equal.

- Tumour stage – I, II, III, IV, tumour grade – I II, III, IV

- Likert scale – excellent, very good, good, fair, poor

Level of measurement … cont’dLevel of measurement … cont’d

An interval variable: has equal distances between values with no meaningful ‘zero’ value.

- IQ test (the differences between numbers are meaningful but the ratios between them are not)

An ratio variable: has equal intervals between values and a meaningful zero point. The ratio between them makes sense.

- height, weight, laboratory test values, age

Primary objective: To compare the post-operative pain between laparoscopic and open surgery in patients with colorectal cancer

Secondary objective: To compare the post-operative complications between laparoscopic and open surgery in patients with colorectal cancer

For exampleFor example

Independent (Explanatory) variables:Age, Sex, Pre-op pain Severity

Dependent/outcome variables:Changes in pain, Complication

Independent (Comparison) variable

Data Editing Data Editing Validity edits: Ensure that:

essential fields have been completed and there are no missing information

◘ specified units of measure have been properly used and the measurements are within the acceptable range.

Duplication edits: Ensure that each case/patient have been entered into the database only once.

Statistical edits: Identify and double check all the extreme values, suspicious data and outliers.

Descriptive StatisticsDescriptive Statistics

… are a means of organizing and summarizing observations.

We examine variables in order to describe their main features.

It is the basic strategies that help us organize our exploration of a set of data:

◙ Begin by examining each variable.

◙ Examine the distribution of each variable by creating frequency tables, numerical summaries and graphs.

◙ Study the relationships between the variables.

Examining Distributions: Categorical …Examining Distributions: Categorical …

Numbers

Frequencies (counts), cumulative frequencies

Relative frequencies (%), cumulative relative frequencies (%)

Graphs

Bar charts

Pie charts

Cross-tabulation of categorical dataCross-tabulation of categorical data

Severity of disease

7 23.3 23.3 23.3

13 43.3 43.3 66.7

10 33.3 33.3 100.0

30 100.0 100.0

0

1

2

Total

ValidFrequency Percent Valid Percent

CumulativePercent

13 86.7% 11 73.3%

2 13.3% 4 26.7%

15 100.0% 15 100.0%

No

Yes

Total

ComplicationsCount Column N %

Open

Count Column N %

Lap

Type of surgery

Cross-tabulation of categorical dataCross-tabulation of categorical data

Examining Distributions: Categorical …Examining Distributions: Categorical …

Numbers

Frequencies (counts), cumulative frequencies

Relative frequencies (%), cumulative relative frequencies (%)

Graphs

Bar charts

Pie charts

Bar ChartsBar Charts

Bar ChartsBar Charts

Bar charts …Bar charts …

A bar chart can be used to depict any levels of measurement (nominal, ordinal, interval, or ratio).

A series of separated bars (vertical or Horizontal), one per category.

Bars represent frequency (counts) or relative frequency (percent or proportion) of each category.

A Bar chart is also useful for showing data for more than one group.

Pie ChartsPie Charts

Pie charts …Pie charts …

Used primarily for nominal and ordinal data.

Used to display relative frequency distribution.

The circle is divided proportionally using relative frequency of each category.

A pie chart is useful for showing data for one group but it is useless for graphic illustration of two or more groups.

Examining Distributions: Quantitative …Examining Distributions: Quantitative … Numbers

Measures of central tendency – mean, median, mode

Measures of variation around mean – variance, standard deviation, standard error of mean

Measures of variation around median – percentiles, quintiles, quartiles

Graphs

Histograms

The five-number summary Box plots

Mean: sum of observations divided by number of observations

Median: is a midpoint of a distribution after arranging all observations in order of size, from smallest to largest.

Mode: most frequent value – the highest peak

Measures of central tendencyMeasures of central tendency

n

x=X

n

1=ii∑

Properties of mean …Properties of mean … It is used for interval or ratio data.

A set of data has only a mean.

All values are included in the computation.

It is the only measure of central tendency where the sum of deviations of each value from the mean will always be zero.

The mean is a useful measures for comparing two or more sets of data.

The mean is sensitive toward extreme values.

-∑ )XX(_n

1=ii

Properties of median …Properties of median … It is used for interval or ratio data.

There is a unique median for each data set.

The median is not necessarily equal to one of the sample values.

It is resistant (insensitive) toward extreme values.

It is useful for summarising skewed data.

Variance: the average of the squares of the deviations of the data from their mean

Standard deviation: square root of variance

Standard error:

Measures of variation around meanMeasures of variation around mean

∑-

-n

1=i

2i2

1n

)xx(=σ

∑-

-n

1=i

2i

1n

)xx(=σ

n

σ=.e.s

Properties of variance …Properties of variance …

All values are used on calculation.

The units are not the same as data, they are the square of the original units.

Properties of standard deviation …Properties of standard deviation …

The units are the same as data

It is used for Empirical Rule.

For any symmetrical distribution:

◘ About 68% of the observations will lie within 1 s. d. of the mean.

◘ About 95% of the observations will lie within 2 s. d. of the mean.

◘ About 99.8% of the observations will lie within 3 s. d. of the mean.

The Empirical RuleThe Empirical Rule

Measures of variation around medianMeasures of variation around median

Percentiles:

Arrange the observations from smallest to largest.

Divide into 100 equal parts;

for example; the 5th percentiles of a distribution is the value which 5% of the observations fall below and 95% fall above.

Quartiles: 25th, 50th and 75th percentiles

Quintiles: 20th, 40th, 60th, and 80th percentiles

Deciles: 10th, 20th, 30th, 40th, 50th,……10th percentiles

Statistics

Age30

0

63.87

1.494

64.00

58a

8.182

66.947

38

44

82

58.75

64.00

69.50

Valid

Missing

N

Mean

Std. Error of Mean

Median

Mode

Std. Deviation

Variance

Range

Minimum

Maximum

25

50

75

Percentiles

Multiple modes exist. The smallest value is showna.

Examining Distributions: Quantitative …Examining Distributions: Quantitative … Numbers

Measures of central tendency; mean, median, mode

Measures of variation around mean – variance, standard deviation, standard error of mean

Measures of variation around median – percentiles, quintiles, quartiles

Graphs

Histograms

The five-number summary Boxplot

HistogramHistogram

Outliers??

Histograms …Histograms …

Used for interval and ratio data.

A histogram is a graph in which each bar (horizontal axis) represent a range of numbers called interval width. The vertical axis represents the frequency of each interval.

There are no spaces between bars.

Histogram is useful for graphic illustration of one group.

Box plot: 5 – number summaryBox plot: 5 – number summary

Range = Max - Min

IQR = Q3 – Q1

Whiskers

Whiskers Q3

Q1Median/Q2

Inner fence

Inner fence

Outliers

1st

100th

Box plot of change in pain score Box plot of change in pain score

Box Plots …Box Plots …

Used for interval and ratio data.

Uses the five-number summary measures

Median, Q1, Q3, minimum and maximum.

It is useful in detecting outliers

It is useful to illustrate the distribution of more than on group.

What are outliers … ?What are outliers … ?

Outliers are extreme data values that fall outside of distribution of the data set.

Box plot: 5 – number summaryBox plot: 5 – number summary

IQR = Q3 – Q1

Whiskers

Whiskers Q3

Q1Median/Q2

Inner fence

Inner fence

1st

100th

1.5 IQR Criterion for Outliers1.5 IQR Criterion for Outliers

Interquartile range (IQR) is the distance between the first and third quartiles. IQR = Q3 – Q1

From data

Q1 = 59 yrs, Q3 = 70 yrs,

IQR = 70 – 59 = 11

1.5 IQR = 1.5 11 = 16.5

Q1 – IQR = 59 – 16.5 = 42.5

Q3 + IQR = 70 + 16.5 = 86.5

From data: Min= 44 and Max = 82

Properties of quartiles, quintiles…Properties of quartiles, quintiles…

It is used for interval or ratio data.

It is resistant (insensitive) to extreme values.

It is useful for summarising skewed data.

How to deal with skewed dataHow to deal with skewed data

Transform the data: Square/square root – (Poisson) count dataLog(x) or ln(x) – data is skewed toward rightReciprocal (1/X) - data is skewed toward left

Transformation:Make skewed data more symmetric Makes distribution more normal Stabilize variabilityLiberalize a relationship between two or more variables

Show summary stat in original but analyse on the transformed data

Summary of what we have learned ….Summary of what we have learned ….

Always plot your data: make a graph, e.i. histogram, box plot

Look for overall pattern (shape, centre and spread) and for striking deviations such as outliers

Check to see if overall pattern of distribution can be described by normal distribution.

If not uniform, transform data to make skewed data more symmetric

Calculate an appropriate numerical summary to describe centre and spread

Recommended