15
1 DESCRIPTIVE STATISTICS Dr Najib Majdi bin Yaacob MD, MPH, DrPH (Epidemiology) USM Unit of Biostatistics & Research Methodology School of Medical Sciences Universiti Sains Malaysia. Content Introduction to statistics Descriptive vs. inferential statistics Variables Types of variables Organizing and displaying data for categorical variables Organizing and displaying data for categorical variables DATA VS. STATISTIC VS. STATISTICS INTRODUCTION TO STATISTICS Data & Statistic Data: A collection of items of information. Statistic : A summary of value of some attribute of a sample, usually but not necessarily as an estimator of some population parameter. Is calculated by applying a function to the values of the items of the sample (Porta, M. (2014). A Dictionary of Epidemiology: Oxford University Press, USA) Statistics The science of collecting, summarizing, and analyzing data . Data may or may not subject to random variation. The data themselves and summarizations of the data. Porta, M. (2008). A Dictionary of Epidemiology: Oxford University Press, USA A Branch of applied mathematics concerned with the collection and interpretation of quantitative data and the use of probability theory to estimate population parameters. http://www.thefreedictionary.com Example: Data; ID Gender Height (m) 1 Male 1.67 2 Male 1.73 3 Female 1.61 4 Male 1.63 5 Female 1.57 6 Female 1.62 7 Female 1.53

Data & Statistic - Universiti Sains Malaysiamedic.usm.my/biostat/files/documents/basic 2016/nmy... · 2016-07-19 · • Involves collection, organization, analysis, interpretation

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Data & Statistic - Universiti Sains Malaysiamedic.usm.my/biostat/files/documents/basic 2016/nmy... · 2016-07-19 · • Involves collection, organization, analysis, interpretation

1

DESCRIPTIVE STATISTICS

Dr Najib Majdi bin Yaacob

MD, MPH, DrPH (Epidemiology) USM

Unit of Biostatistics & Research Methodology

School of Medical Sciences

Universiti Sains Malaysia.

Content

• Introduction to statistics

• Descriptive vs. inferential statistics

• Variables

• Types of variables

• Organizing and displaying data for categorical variables

• Organizing and displaying data for categorical variables

DATA VS. STATISTIC VS. STATISTICSINTRODUCTION TO STATISTICS

Data & Statistic

• Data: A collection of items of information.

• Statistic : A summary of value of some attribute of a sample, usually but not necessarily as an estimator of some population parameter. Is calculated by applying a function to the values of the items of the sample

(Porta, M. (2014). A Dictionary of Epidemiology: Oxford University Press, USA)

Statistics

• The science of collecting, summarizing, and analyzing data. Data may or may not subject to random variation.

• The data themselves and summarizations of the data.

Porta, M. (2008). A Dictionary of Epidemiology: Oxford University Press, USA

• A Branch of applied mathematics concerned with the collection and interpretation of quantitative data and the use of probability theory to estimate population parameters.

http://www.thefreedictionary.com

Example:

• Data; ID Gender Height (m)

1 Male 1.67

2 Male 1.73

3 Female 1.61

4 Male 1.63

5 Female 1.57

6 Female 1.62

7 Female 1.53

Page 2: Data & Statistic - Universiti Sains Malaysiamedic.usm.my/biostat/files/documents/basic 2016/nmy... · 2016-07-19 · • Involves collection, organization, analysis, interpretation

2

Example:

• Statistic;– 4 (57.1%) Female, 3 (42.9%) Male

– Mean height = 1.62m

– Standard deviation for height = 0.06m

• Statistics– The process of calculating the statistic. How to

calculate the frequency and percentage for gender and how to calculate mean and standard deviation for height.

Why use statistics?

• Modern society concern with reading & writing

• Statistics in used to make the strongest possible conclusions from limited amount of data.

• A more thorough understanding of research literature will lead to improves patient care.

DESCRIPTIVE VS. INFERENTIAL BRANCHES OF STATISTICS

Descriptive statistics

• Describe and summarize dataset

• Involves collection, organization, analysis, interpretation and presentation of sample data

• Can be presented in tables, graphs or narrative format

Descriptive statistics

• Purpose

– Describe the characteristics of study participants

– Understand the data

– Answer the research questions in descriptive study

– Detect outliers or extreme values

How to describe this population?

Page 3: Data & Statistic - Universiti Sains Malaysiamedic.usm.my/biostat/files/documents/basic 2016/nmy... · 2016-07-19 · • Involves collection, organization, analysis, interpretation

3

How to describe this population?

samples

Describe samples

Descriptive statistics

– Frequency distribution

– Measures of central tendency

– Measures of dispersion

– Measures of position

– Exploratory data analysis

– Measures of shape of distribution: graphs, skewness, kurtosis

– Estimation

– Hypothesis testing reach a decision– Parametric statistics

– Non-parametric statistics (distribution free statistics)

– Modelling, predicting.

Inferential statisticsHow to make conclusion from this

population?

How to make conclusion from this population?

samples

Infer findings to population

Inferential statistics

VARIABLE

Page 4: Data & Statistic - Universiti Sains Malaysiamedic.usm.my/biostat/files/documents/basic 2016/nmy... · 2016-07-19 · • Involves collection, organization, analysis, interpretation

4

Variables

• Any quantity that have different values across individuals or other study units. (Porta, M. (2014). A

Dictionary of Epidemiology: Oxford University Press, USA)

• Variables• Independent

• Dependent

Variables

• Independent variable

– A variable that is hypothesized to influence an event or state (the dependent variable)

– The independent variable is not influenced by the event but may cause (or contribute to the occurrence of) the event, or contribute to change the (psychological, environmental, socioeconomic) status.

Variables

• Dependent variable

– A variable the value of which is dependent on the effect of another variable(s) – the independent variable(s) – in the relationship under study.

– A manifestation or outcome whose variation we seek to explain or account for by the influence of independent variables.

Variables

Effect of sunlight to plant growth

Variables

Effect of sunlight to plant growth

Independent variableDependent variable

Variables

Effect of sunlight to plant growth

X axis: Independent variable

Y ax

is: D

epen

den

t va

riab

le

Page 5: Data & Statistic - Universiti Sains Malaysiamedic.usm.my/biostat/files/documents/basic 2016/nmy... · 2016-07-19 · • Involves collection, organization, analysis, interpretation

5

Variables

• Controlled variable(s)– Everything you want to remain constant and unchanged

during the study period

– Example: Investigating effect of sunlight exposure duration (hours/day) to plant growth• Independent variable: Duration of sunlight exposure

• Dependent variable: Plant height

• Controlled variable: type of plant, size of pot, amount of water, type of soil etc.

MEASUREMENT SCALETYPES OF VARIABLES

Measurement scale

• Different types of scale are measured differently

• Knowledge about the measurement scale/data helps in deciding how to organize, analyse and present the data.

• Four fundamental scale ;– Nominal

– Ordinal

– Interval

– Ratio

Classification of data

Data

Categorical (qualitative)

Nominal Ordinal

Numerical

(quantitative)

Interval Ratio

Less info More Info

Categorical data: Nominal scales

• Names or categories, mutually exclusive

• Does not imply any ordering of responses

• Example;

– Sex: Male, Female

– Race: Malay, Chinese, Indian, Others

• Lowest and least informative level of measurement

Categorical data: Ordinal scales

• Names or categorizes which are mutually exclusive and the order is meaningful

• Example;– Severity: mild, moderate, severe

– Socioeconomic status: Low, Middle, High

• Limitation;– Can’t assume the differences between adjacent

scale values are equal

– Can’t make this assumption even if the labels are number

Page 6: Data & Statistic - Universiti Sains Malaysiamedic.usm.my/biostat/files/documents/basic 2016/nmy... · 2016-07-19 · • Involves collection, organization, analysis, interpretation

6

Numerical data: Interval scales

• Names or categorizes, the order is meaningful, the intervals are equal.

• Example;

– Fahrenheit temperature scale

– Celsius temperature scale

• Problem:

– No true zero point (Zero point is arbitrary)

Interval scales

Zero does not mean complete absence of temperature

Numerical data: Ratio scales

• Highest and most informative scale

• Contains the qualities of the nominal, ordinal and interval scale with the addition of an absolute zero point.

• Example:

– Amount of money

– Age

– Blood pressure

• The values were able to be multiple or divide

Ratio scales

Zero in Kelvin scale is absolute absence of thermal energy. Kelvin scale is therefore considered as ratio scale.

Numerical data

• Interval and ratio variables are sometime indistinguishable, and handled the same way in data analysis.

• Both can be converted to categorical data

• Converting numerical to categorical data causes lost of information

Summary of data types and scale measurement

Provides Nominal Ordinal Interval Ratio

Counts/frequency of distribution

Mode, median

The “order” of values is known

Can quantify the difference between each value

Can add or subtract values

Can multiple and divide values

Has “true zero”

Page 7: Data & Statistic - Universiti Sains Malaysiamedic.usm.my/biostat/files/documents/basic 2016/nmy... · 2016-07-19 · • Involves collection, organization, analysis, interpretation

7

ORGANIZING & DISPLAYING DATA FOR CATEGORICAL VARIABLE

Organizing & displaying data for categorical variable

• Table:

– Frequency table

• Frequency

• Relative frequency (percentage)

• Cumulative frequency (cumulative percentage)

• Graphical:

– Bar chart

– Pie chart

Frequency tableOutput from SPSS

Bar chart

Characteristics;1. Y axis represent

frequency2. X axis represent

categorical variables

3. Equal width of bars4. Bars separated by

equal gaps5. Height represent

frequency or percent

Pie chart

Characteristics;1. Size of slice

represent frequency or percent

2. Each piece of slice represent ach category

3. Combination of all slices must add up to 100%

Excellent graphical presentation of data

• Accuracy: proper data entry, not misleading, distortion or susceptible to misinterpretation

• Clarity: The ideas and concept conveyed are clearly understood

• Simplicity: Straight forward, avoid gridlines or odd lettering

• Appearance: should be appealing

• Well-designed structure: pattern highlighted, letterings are horizontal

Page 8: Data & Statistic - Universiti Sains Malaysiamedic.usm.my/biostat/files/documents/basic 2016/nmy... · 2016-07-19 · • Involves collection, organization, analysis, interpretation

8

ORGANIZING & DISPLAYING DATA FOR NUMERICAL DATA

Organizing & displaying data for numerical data

• Central tendency

• Dispersion

• Exploratory data analysis1. Stem & leaf displays

2. Box and whisker plots

• Frequency1. Histogram

2. Frequency polygon

3. Cumulative frequency

• Shape of distribution

Measures of central tendency

1. Mean

2. Median

3. Mode

Measures of central tendency

1. Mean

– Sample average

– Sum all values, divided by the number of values

– Sensitive to extreme values

Example:

What is the mean height of these 9 students?

n

X

X

n

i

i 1

id 1 2 3 4 5 6 7 8 9height (cm) 167 176 174 172 170 162 171 171 169

Measures of central tendency

2. Median

– Middle value

– Not sensitive to extreme value

– Used to summarize a skewed data

– When n is odd, median=[(n+1)/2]th value

– When n is even, median=average of (n/2)th and [(n/2)+1]th value

Measures of central tendency

2. Median

Example:

What is the median height of these 9 students?

id 1 2 3 4 5 6 7 8 9height (cm) 167 176 174 172 170 162 171 171 169

Page 9: Data & Statistic - Universiti Sains Malaysiamedic.usm.my/biostat/files/documents/basic 2016/nmy... · 2016-07-19 · • Involves collection, organization, analysis, interpretation

9

Measures of central tendency

2. Median

Example:

What is the median height of these 9 students?

N=9, median = (9+1)/2th value = 5th value

1 2 3 4 5 6 7 8 9sort 162 167 169 170 171 171 172 174 176

Measures of central tendency

3. Mode

– Observation that occur most frequently

– Less useful in describing data

Measures of dispersion

1. Range

2. Variance

3. Standard deviation

4. Coefficient of variation

5. Inter quartile range

Measures of dispersion

1. Range

– Largest value – smallest value (max-min)

– Sensitive to extreme values

Measures of dispersion

2. Variance

– Measures the amount of spread or variability of observation from mean

– The sample variance (s2)=the average of the square of the deviations about the sample mean (population variance=2)

– Not used in descriptive statistics because difficulty in interpreting a ‘square’ unit of data.

1

)(1

2

12

n

XX

s

n

i

Measures of dispersion

3. Standard deviation

– Square root of variance

– Most widely used and better measure of variability

– The smaller the value, the closer to the mean

– Sensitive to extreme values

1

)(1

2

1

n

XX

s

n

i

Page 10: Data & Statistic - Universiti Sains Malaysiamedic.usm.my/biostat/files/documents/basic 2016/nmy... · 2016-07-19 · • Involves collection, organization, analysis, interpretation

10

Measures of dispersion

4. Coefficient of variation

Ratio of the standard deviation to the mean

Expressed as percentage

Also known as relative standard deviation

Shows the extent of variability in relation to the mean.

X

sCoV

Hands-on

• Calculate/find the range, variance, standard deviation and coefficient of variation for numerical variables in the given data file.

(5 minutes)

id 1 2 3 4 5 6 7 8 9height (cm) 167 176 174 172 170 162 171 171 169

Measures of dispersion

4. Inter quartile range:

Data can be divided into quarter or four equal parts;

Q1=25th percentile

Q2=50th percentile

Q3=75th percentile

IQR is the distance from Q1 to Q3

Min Max

Measures of dispersion

4. Inter quartile range:

The most common inter percentile measure

Not sensitive to extreme values (outliers)

Usually described together with median in skewed distribution observation

In SPSS In SPSS

Page 11: Data & Statistic - Universiti Sains Malaysiamedic.usm.my/biostat/files/documents/basic 2016/nmy... · 2016-07-19 · • Involves collection, organization, analysis, interpretation

11

GRAPHICAL VISUALIZATION/ PRESENTATION FOR NUMERICAL DATA

Exploratory data analysis

1. Stem & leaf displays

2. Box and whisker plots

Exploratory data analysis

Stem & leaf displays

Allows easier identification of individual values in the sample

height Stem-and-Leaf Plot

Frequency Stem & Leaf

1.00 Extremes (=<162)2.00 16 . 795.00 17 . 011241.00 17 . 6

Stem width: 10Each leaf: 1 case(s)

id 1 2 3 4 5 6 7 8 9height (cm) 167 176 174 172 170 162 171 171 169

Exploratory data analysis

Box and whisker plots

• Graphical display of percentile

• Also known as 5 number summary plot (min, Q1, Q2, Q3, max)

• Provide information on central tendency and variability of the middle 50% of the distribution

• Box represent 25th to 75th percentile

Exploratory data analysis

Box and whisker plots

• Observation >1.5 times IQR away from the edge of the box is/are the outlier(s)

• Observation >3 times IQR away is/are the extreme outlier(s)

• Whisker are made of smallest and largest value outside the outliers

• Continuous data in multiple groups can be displayed side by side

Exploratory data analysis

Box and whisker plots

Page 12: Data & Statistic - Universiti Sains Malaysiamedic.usm.my/biostat/files/documents/basic 2016/nmy... · 2016-07-19 · • Involves collection, organization, analysis, interpretation

12

Exploratory data analysis

Box and whisker plots

Measures of frequency of distribution: Graphs

1. Histogram

2. Frequency polygon

3. Cumulative frequency

Measures of frequency of distribution: Graphs

Histogram

Graphical representation of the frequency distribution of a variable.

Bar height represent frequency or percent

Bar width represent the interval class

No gap between the interval class

Gives us idea of the distribution: normal distribution or skewed

Measures of frequency of distribution: Graphs

Histogram

Measures of frequency of distribution: Graphs

Frequency polygon

• A graph that displays the data using lines to connect points plotted for the frequency

• The frequency represent the heights of the vertical bars in the histogram

Measures of frequency of distribution: Graphs

Frequency polygon

Page 13: Data & Statistic - Universiti Sains Malaysiamedic.usm.my/biostat/files/documents/basic 2016/nmy... · 2016-07-19 · • Involves collection, organization, analysis, interpretation

13

Measures of frequency of distribution: Graphs

Cumulative frequency

Used to determine the number of observation that lie below or above a particular value

Calculated using a frequency distribution table

Can be constructed from stem and leaf plots or directly from data

Measures of frequency of distribution: Graphs

Cumulative frequency

Measures of shape of distribution

• Skewness

• Kurtosis

Measures of shape of distribution

• Skewness: • measure of asymmetry of a distribution around its

mean.

• Graphically examined by plotting normal curve on histogram

• Negative skewness: left tail is more pronounced than the right tail

• Positive skewness: right tail is more prominent than the left tail.

Measures of shape of distribution

• Skewness:

Measures of shape of distribution

• Kurtosis;• Relative peakness or flatness of a distribution

compared with the normal distribution.

• Visualised by plotting a normal curve on histogram

• Types;– Distribution with a high peak: leptokurtic

– Distribution with a flat-topped curve: platykurtic

– Normal distribution: mesokurtic

Page 14: Data & Statistic - Universiti Sains Malaysiamedic.usm.my/biostat/files/documents/basic 2016/nmy... · 2016-07-19 · • Involves collection, organization, analysis, interpretation

14

Measures of shape of distribution

• Kurtosis;

HOW TO PRESENT

General rule

• Can be presented in either graphical, table or text format

• Categorical variable: • n (%)

• Numerical variable:• Symmetric data: mean (standard deviation)

• Skewed data: median (IQR)

How to decide symmetric or skewed?

• Statistical

– Mean = median = mode

– Skewness

– Kurtosis

– Kolmogorov-Smirnov test (p>0.05)

– Shapiro Wilk test (P>0.05)

How to decide symmetric or skewed?

• Graphical

– Histogram

– Stem and Leaf plot

– Box and whisker plot

Table presentation

Table 1: Characteristic of study participants (n=30)

Variable Mean (SD) n (%)

Age (yrs) Sex Female Male

Race Malay Chinese Indian

Education Primary Secondary Tertiary

BMI (kg/m2) DBP (mmHg) SBP (mmHg)

*median (IQR)

Page 15: Data & Statistic - Universiti Sains Malaysiamedic.usm.my/biostat/files/documents/basic 2016/nmy... · 2016-07-19 · • Involves collection, organization, analysis, interpretation

15

THANK YOU.