View
221
Download
0
Category
Preview:
Citation preview
Introduction to Data Handling
A Fast Hour
Review of data types Scalar, ordinal, nominal
Decisions regarding encoding data Turning information into analyzable data Dealing with missing data
The structure of experimental data Getting things into 2 dimensional (or a few dimensional) tables
Deciding on which software to use Excel Spreadsheet-style analysis packages Scripted analysis
Review of Data Types
Review of Data Types
Scalar Continuous Discrete
Ordinal
Nominal
Scalar Data
Continuous Data Real numbers used to measure magnitude Unbounded at least in one direction Ex: Average Dilantin level
(3.1+4.4)/2 = 3.75
Discrete Data Data that can take on a finite number of values Unbounded at least in one direction Ex: Average number of fingers
(5+4)/2 ≠ 4.5, but is ‘in between’ 4 and 5
Scalar Data
Truly continuous data are theoretical – you don’t run into them in the real world
Because of limitations of measurement (e.g., significant figures), scalar data are actually discrete
In most real life applications, discrete data can be handled as if continuous Just beware of the ‘2.3 kids’ problem
Ordinal Data
Data whose attributes are ordered but for which the numerical differences between adjacent attributes are not necessarily interpreted as equal
Bounded Scale has some upper and lower limit
Classic Example: Glasgow Coma Scale GCS of 4 intuitively ranks lower than GCS of 5 Difference of GCS of 14 and 15 is not the same as difference
between GCS of 3 and GCS of 4 GCS of 4 + GCS of 5 ≠ GCS of 9
Nominal Data
May have an assigned numerical value for analytical reasons, but there is no numerical underpinning for the variables
Example: Race African american = 1 Hispanic = 2 Asian = 3 1 + 2 ≠ 3
Turning information into analyzable data
Turning information into analyzable data Discrete data are usually easy
Age Vital signs One dimensional measures (e.g., Hgb, time-to-relapse)
Ordinal and nominal data get tricky If you’re only going to do descriptive statistics, it doesn’t
matter much If you’re going to model (e.g., do regression) it gets
involved
Real Life Example from the Camp Survey Question 3. On a usual camp day, the
person on site with the highest level of health care training is a:
Physician Registered nurse Licensed practical nurse Licensed paramedic Licensed EMT Licensed first responder First aid provider
Real Life Example from the Camp Survey
What type of variable would you use?
Real Life Example from the Camp Survey One choice:
A continuous variable
On a usual camp day, how many years of training has the senior-most caregiver completed?
Var_Years
Real Life Example from the Camp Survey Another more likely choice:
An ordinal variable
Physician = 1
RN = 2
LPN = 3
Paramedic = 4
EMT = 5
First responder = 6
First aid = 7
Var_Caregiver
Real Life Example from the Camp Survey A Third Choice
Seven nominal ‘dummy variables’
Var_MD = 1 or 0 (yes or no)Var_RN = 1 or 0 Var_LPN = 1 or 0 Var_Para = 1 or 0 Var_EMT = 1 or 0 Var_Respond = 1 or 0 Var_FirstAid = 1 or 0
Real Life Example from the Camp Survey Who cares?
Var_Caregiver
1234567
1 0 0 0 0 0 0
0 1 0 0 0 0 0
0 0 1 0 0 0 0
0 0 0 1 0 0 0
0 0 0 0 1 0 0
0 0 0 0 0 1 0
0 0 0 0 0 0 1
7 DummyVariables
Var_Years + Real Numbers
A Basic Modeling Problem
Is there a relationship between the level of on-site caregiver training and the number of deaths per year at camp?
Deaths = f (Caregiver Level)
Num
ber
of D
eath
s
Var_Caregiver1 7
Deaths = b1x1 + b0
where x1 = Var_Caregiver (1-7)b1 = a coefficientb0 = the y-intercept
Num
ber
of D
eath
s
Var
_MD
Var_
Firs
t_Ai
d
Deaths = b1x1 + b2x2 + b3x3 + b4x4 + b5x5 + b6x6 + b7x7+ b0
where x1 = Var_MD, x2 = Var_RN, etc.b1-7 = are coefficients for each xb0 = the y-intercept
Var
_RN
Var
_Par
a
Var
_LP
N
Var
_EM
TV
ar_R
espo
nd
Num
ber
of
Dea
ths
Var_Caregiver1 7
Nu
mb
er
of
De
ath
s
Var_MD Var_First_Aid
Pros: Easy to compute Easy to understand
Cons: Forces a ‘continuous’ structure onto Var_Caregiver that may not really exist
Pros: Agrees more closely with experimental results Doesn’t impose any relationship between different provider levels Cons: Less easy to understand ‘Discards’ the knowledge that some caregivers have more training than others
Decisions regarding how to encode data Stick as close to the ‘raw measurement’ as
you can
Stick close to original measurement When you call an ambulance in an
emergency, how long does it take for the ambulance to get to your camp?
< 5 minutes (Time = 1) 5-10 minutes (Time = 2) 10-15 minutes (Time = 3) 15-20 minutes (Time = 4) > 20 minutes (Time = 5) Don’t know (Time = 6)
Good, bad,Indifferent?
Stick close to original measurement Do you know how long it would take an
ambulance to respond to a call from your camp? (y/n)
If so, how many minutes? (some discrete #)
Decisions regarding how to encode data Stick as close to the ‘raw measurement’ as
you can Abstraction seems useful, but distances you from
what you were originally looking at Keep continuous data continuous if at all possible Likewise preserve ordinal and nominal data Later on, you can ‘digest’ the raw data into
categories, etc., as necessary.
Decisions regarding how to encode data Remember:
Data can always be made more general during analysis. They cannot be made more specific.
Decisions regarding how to encode data Stick as close to the ‘raw measurement’ as
you can
Avoid bundling more than one idea into a single variable Ex. <5, 5-10, 10-15, 15-20, > 20, ‘Don’t Know’
Decisions regarding how to encode data Stick as close to the ‘raw measurement’ as
you can
Avoid bundling more than one idea into a single variable
Use a specific plan for missing data!
Missing Data
Blank data cells are ambiguous Data not provided/collected? Data erroneously omitted? Data provided but nonsensical?
Note: Many statistical packages will ignore an entire ‘observation’ if a data point is missing!!!
Missing Data
Pick something (other than nothing) to denote a missing data point ‘.’ or ‘Null’ are commonly used
The Structure of Data
Statistical analysis is based on the idea of ‘observations’ An observation often is a patient (and all of the
data you collect about that patient) Really is just an experimental ‘unit’ or ‘trial,’ such
as one summer camp or one hospital day
Any analysis of many observations requires you establish a ‘structure’ for your observations
The Structure of Data
You’ll need to think about the ‘shape’ of your experimental data early in your study Preferably during planning
Fortunately, very many data sets can be structured into a tabular form For better or worse, Excel is used really often
The Structure of Data
Obs # Last Name Systolic BP Diastolic BP
1 Fawcett 114 54
2 Smith 93 42
3 Jackson 78 49
4 Ladd 58 38
Fields
Observations
Don’t confuse a 2-dimensional data table with 2-dimensional data!
Ultimately, every observation is a mathematical ‘vector’ that completely describes that event in an n-dimensional space
Fawcett
Jackson
Smith
Ladd
SB
P
DBP
Your data have as many dimensions as they havedata fields!
(Unavoidable) Shortcomings of Tabular Data Large Number of Fields or Observations
Difficult to ‘look’ at all of the data
Troubles with Repeated Measures
Handling Repeated Measure Data in a Tabular Data Structure
Pat
ient
ID
Wei
ght
Day
1
BU
N D
ay 1
Urin
e D
ay 1
Wei
ght
Day
2
BU
N D
ay 2
Urin
e D
ay 2
Wei
ght
Day
3
BU
N D
ay 3
Urin
e D
ay 3
Wei
ght
Day
4
BU
N D
ay 4
Urin
e D
ay 4
Wei
ght
Day
5
BU
N D
ay 5
Urin
e D
ay 5
Obs # Last Name Hospital Day Systolic BP
1 Fawcett 1 84
2 Fawcett 2 72
3 Fawcett 3 84
4 Smith 1 94
Handling Repeated Measures in Tabular Data Structures
• The ‘Day in the Life’ strategy• A Patient Day becomes the observation• Can be a more compact way of saving data
DemographicData
Daily Data(For each of 7
Study days
BacterialIsolateDataOutcome
Data
Using Relational Databases for More Complex Data
Deciding Which Software to Use Some useful groundrules
1. Use software with all of the tools you need
2. Don’t make things unnecessarily complicated
3. Know in advance what your statistical collaborators are going to use, and how they like the data to appear
Deciding Which Software to Use Data-entry Level Tools
Input method other than just entering fields in Excel spreadsheet ‘Forms’ type page Interface with other data types Interface with Scantron Interface with analytical instruments
Deciding Which Software to Use Data-entry Level Tools
Entry error control Double entry Restricted data fields that must fit a particular format or
be rejected Merging data sets
Doing this by hand is fine for 15 patients, but not for 1,500
Deciding Which Software to Use Data Manipulation Needs
Do your data need some post-collection modification prior to analysis? Transformation (e.g., log-transforming to achieve normal
statistical distributions) Relabeling missing data fields Text or numerical string modification
E.g. changing all dates to MM/DD/YYYY Internal data consistency checks
E.g. is the number of ICU days < the number of hospital days?
Deciding Which Software to Use What Analyses are You Going to Perform?
Summary Statistics Frequencies, means, etc.
Simple x by y regressions Contingency tables (and 2) ANOVA Multivariate modeling Logistical modeling Nonlinear modeling
Easy in Excel
Not Easy in Excel
Best Handled inDedicated StatsPackages or Elsewhere
Deciding Which Software to Use Output Needs
Tabular data that can be dumped into a word processor Text files Cut-and-paste
Graph preparations and dumping Cut-and-paste Specialized output formats
.tif, .jpg, .svg, MS metafiles Colors (RGB v. CMYK)
Deciding Which Software to Use Other needs you might not have thought
about but that are really important Interim “noodling” type analysis Needing to repeat the analysis on multiple data
sets, or to ‘update’ the analysis if new data become available
Deciding Which Software to Use
Spreadsheets Excel
Spreadsheet and ‘Pull-Down’ Stats Packages SPSS, Prism (Graphpad), JMP
Database Managers Access, Foxpro
Scripted Statistical Languages SAS, R, MatLab
Incr
easi
ng L
evel
of
Org
aniz
atio
nIn
crea
sing
Fro
nt-E
nd T
ime
Handling Your Data in Excel
Few up-front requirements Load your data and you’re ready to go Many simple stats can be done as ‘one off’
analyses VERY Inflexible
You pay for your choice later on in debugging, rerunning analysis, editing the data set, etc.
Using Spreadsheet-’Pulldown’ Stats Packages Is the most power most users will ever need
Slightly more up-front time Forced data structures are like eating oatmeal Most have integrated graphics utilities
Some unusual applications are tough to manage Nonlinear analysis
Using Scripted Statistical Packages When you anticipate running relatively complicated
analyses on a series of data sets
When you can design the analysis plan without having all of the data available
When you must document exactly how you did your analysis and be able to exactly duplicate it at will Which is arguably every time (!)
g<-read.csv("expdata2.csv",header=TRUE)gmat<-as.matrix(g)gmati<-gmat*-1heatplot(gmati,Colv=NA)
Back-End Utilities
Graphical Output Excel has horrible graphics that can be spotted a
mile away in journals Most stats packages will do better
Consider ‘Post-Processing’ in Dedicated Graphics Software E.g., Adobe Illustrator
Research is a Data Business, Use the Tools at Your Disposal
Data Input System
DedicatedDatabaseManager
StatisticalPackage
StatisticalPackage
StatisticalPackage
GraphingSystem
GraphicsPolishing for Publication
Other Very Important Resources Google
Almost everything you need to know Most of it’s pretty accurate
Java Applets Many stats applications can be found on line that will run on any
machine
Open source code is on its way R Linux
CSCAR Sometimes more helpful than others.
Who Will Not Be Helpful
MCIT
Questions? People and their Software
Sue Stern JMP (The SAS ‘PullDown’ Package) Repeated measures analysis of clinical data
Bonnie Singal SAS Pretty much any clinical statistical research question
Matt Trowbridge Stats and GIS packages Merging complex data sets
John Younger SAS, Prism, R Kinetics, Logistic and nonlinear models of complex behaviors
Recommended