Upload
noyeem-mahbub
View
222
Download
0
Embed Size (px)
Citation preview
8/12/2019 Analysis Complex Samples 131108
http://slidepdf.com/reader/full/analysis-complex-samples-131108 1/31
Using R to analyse
complex survey samples
Thomas LumleyAssociate Professor of Biostatistics,
University of Washington.
R Core Development Team
8/12/2019 Analysis Complex Samples 131108
http://slidepdf.com/reader/full/analysis-complex-samples-131108 2/31
Outline• The R survey package
• Why has R become successful?
• Why open-source software matters to statistics
8/12/2019 Analysis Complex Samples 131108
http://slidepdf.com/reader/full/analysis-complex-samples-131108 3/31
R (needs no introduction)
• An open-source reimplementation of the S language
from Bell Labs• Initially a Kiwi creation, now used around the world
– 2008 Pickering Medal to Ross Ihaka for R
• Probably the most popular medium for distributing
new statistical methodology
– CRAN: 1500 packages
– Bioconductor: 500 packages
8/12/2019 Analysis Complex Samples 131108
http://slidepdf.com/reader/full/analysis-complex-samples-131108 4/31
http://faculty.washington.edu/tlumley/survey/
8/12/2019 Analysis Complex Samples 131108
http://slidepdf.com/reader/full/analysis-complex-samples-131108 5/31
Brief history• 2002: I visit Auckland, start writing survey package
• January 2003: first version released• July 2003: replicate weights
• April 2004: published in J. Stat. Software
• (US) Spring 2005: multistage sampling, calibration
• (US) Winter 2006: two-phase designs
• (NZ) Winter 2008: database-backed designs• August 2009: book (I hope)
8/12/2019 Analysis Complex Samples 131108
http://slidepdf.com/reader/full/analysis-complex-samples-131108 6/31
Design philosophyMostly comes from limited resources
– write in high-level language– code reuse to expose bugs
– keep data in memory (mostly)
– don’t optimize until someone complains (Moore’s Law)– emphasize features that look like biostatistics
Package is about 8000 lines of code– cf 250,000 for VPLX from US Census Bureau
– about 300,000 for all of R; 25,000,000 for SAS (!)
8/12/2019 Analysis Complex Samples 131108
http://slidepdf.com/reader/full/analysis-complex-samples-131108 7/31
Interesting features• Secondary analysis/modelling of large surveys
– graphics, smoothing– regression models
– analysis of multiply-imputed data
• Simulations– R programming language
• Calibration (raking, GREG) estimators
– including calibration for regression models
• Database-backed objects
– data loaded as needed from relational database
8/12/2019 Analysis Complex Samples 131108
http://slidepdf.com/reader/full/analysis-complex-samples-131108 8/31
Why me?[ie: Lumley? What does he know about surveys?]
Semiparametric model-based methods areconverging on design-based inference
– ‘sandwich’ variance estimators
– model-robustness– concept of parameters as functionals on distributions
– IPW in causal inference, missing data
– two-phase sampling in cohort studiesEmphases are different: that’s what users are for.
8/12/2019 Analysis Complex Samples 131108
http://slidepdf.com/reader/full/analysis-complex-samples-131108 9/31
User interface• Data and design meta-data are stored in a survey
design object– ensures meta-data and data are kept together
– subset operator sets up data for domain estimation
– post-stratification/calibration creates new object
• Data variables in the object are specified by model
formulas
8/12/2019 Analysis Complex Samples 131108
http://slidepdf.com/reader/full/analysis-complex-samples-131108 10/31
Example: NHANES IIIdhanes <- svydesign(id=~SDPPSU6, strata=~SDPSTRA6,
weight=~WTPFQX6, data=nhanes3, nest=TRUE)
svymean(~BMPWTMI+BMPHTMI, design=dhanes)
svyquantile(~BMPWTMI, design=dhanes, quantile=0.5)
svytotal(~factor(HAB1MI), design=dhanes)
adults <- subset(dhanes, HSAGEIR>18)
adults <- update(adults,
bmi= BMPWTMI/(BMPHTMI/100)^2 )
adults <- update(adults,bmigp=cut(bmi,c(0,18.5,25,30,Inf)))
svymean(~bmigp, adults)
svyby(~bmigp, ~HAB1MI, svymean, design=adults)
8/12/2019 Analysis Complex Samples 131108
http://slidepdf.com/reader/full/analysis-complex-samples-131108 11/31
Example: Californian schoolsdclus2<-svydesign(id=~dnum+snum, fpc=~fpc1+fpc2,
data=apiclus2)
model1<-svyglm(api00~api99+emer,design=dclus2)
model2<-svyglm(api00~api99+meals+mobility+ell+
emer, design=dclus2)
model3<-svyglm(api00~api99+stype+emer,
design=dclus2)
summary(model1)
summary(model2)
summary(model3)
8/12/2019 Analysis Complex Samples 131108
http://slidepdf.com/reader/full/analysis-complex-samples-131108 12/31
Large data• With all data kept in memory
– on a laptop, NHANES-scale analyses feasible ifrelevant variables selected first
– inexpensive 64-bit Linux systems can handle millions
of records
• Database-backed
– variables loaded on-demand for each command
– hundreds of thousands of records possible on laptop
• 2007 BRFSS: 430,000 records (is just feasible)
8/12/2019 Analysis Complex Samples 131108
http://slidepdf.com/reader/full/analysis-complex-samples-131108 13/31
Database-backeddhanes <- svydesign(id=~SDPPSU6,
strata=~SDPSTRA6, weight=~WTPFQX6,
data=“set1”, dbtype=“ODBC”,
dbname=“nhanes3”, nest=TRUE)
• Specify a SQL database table or view as the data
source. Only read access is needed
• Design metadata is kept in memory, other
variables loaded only as needed
• Works with ODBC, JDBC, and directly with
Oracle, PostgreSQL and other popular databases
8/12/2019 Analysis Complex Samples 131108
http://slidepdf.com/reader/full/analysis-complex-samples-131108 14/31
8/12/2019 Analysis Complex Samples 131108
http://slidepdf.com/reader/full/analysis-complex-samples-131108 15/31
Data from NHIS: about 25k observations
8/12/2019 Analysis Complex Samples 131108
http://slidepdf.com/reader/full/analysis-complex-samples-131108 16/31
8/12/2019 Analysis Complex Samples 131108
http://slidepdf.com/reader/full/analysis-complex-samples-131108 17/31
8/12/2019 Analysis Complex Samples 131108
http://slidepdf.com/reader/full/analysis-complex-samples-131108 18/31
8/12/2019 Analysis Complex Samples 131108
http://slidepdf.com/reader/full/analysis-complex-samples-131108 19/31
Health insurance coverage (by age)
8/12/2019 Analysis Complex Samples 131108
http://slidepdf.com/reader/full/analysis-complex-samples-131108 20/31
8/12/2019 Analysis Complex Samples 131108
http://slidepdf.com/reader/full/analysis-complex-samples-131108 21/31
Why is R successful?
8/12/2019 Analysis Complex Samples 131108
http://slidepdf.com/reader/full/analysis-complex-samples-131108 22/31
Charlton Heston brings SAS down from Mt SinaiCharlton Heston brings SAS down from Mt Sinai
8/12/2019 Analysis Complex Samples 131108
http://slidepdf.com/reader/full/analysis-complex-samples-131108 23/31
R spreads through a terrified nationR spreads through a terrified nation
1998
8/12/2019 Analysis Complex Samples 131108
http://slidepdf.com/reader/full/analysis-complex-samples-131108 24/31
• Extensibility
• Cost• Rapid development
Network effects
Reasons for the R pandemic?
8/12/2019 Analysis Complex Samples 131108
http://slidepdf.com/reader/full/analysis-complex-samples-131108 25/31
Extensibility• Can users write extensions that look like built-in
functionality?
• Can users find these extensions?
• Is it easy to tell what extensions are installed and
to get rid of them?• Can old versions of the software co-exist with new
ones?
8/12/2019 Analysis Complex Samples 131108
http://slidepdf.com/reader/full/analysis-complex-samples-131108 26/31
Free (as in beer)• Price sensitivity should be lower for specialist
statisticians, and for large companies where
statistics is mission-critical
– but these are more likely to use R
• Students are price-sensitive
– low cost is useful in teaching
– academics learn computing from their PhD students
8/12/2019 Analysis Complex Samples 131108
http://slidepdf.com/reader/full/analysis-complex-samples-131108 27/31
Rapid development• User’s syntax is the same as developer’s language
– deliberate design for ‘slippery slope’ to programming
• Functional language, dynamic types
– slow, inefficient memory use– lack of side effects makes it very easy to use
– most of R is written in R
8/12/2019 Analysis Complex Samples 131108
http://slidepdf.com/reader/full/analysis-complex-samples-131108 28/31
8/12/2019 Analysis Complex Samples 131108
http://slidepdf.com/reader/full/analysis-complex-samples-131108 29/31
Why is open-source statistical
software important?
8/12/2019 Analysis Complex Samples 131108
http://slidepdf.com/reader/full/analysis-complex-samples-131108 30/31
Open-source• Three related benefits
– publication of novel methods
– dissemination of good statistical practice
– reproducibility
• Open-source platform is not required, but it helps
– need widely available platform
– need packaging system for distributable code– need archive of old platform versions
8/12/2019 Analysis Complex Samples 131108
http://slidepdf.com/reader/full/analysis-complex-samples-131108 31/31
Code as language• Code describes exactly what analyses you did
– equations miss many practical aspects
– complete and precise descriptions in English are hard,
and more ugly than the code
• Code can be reused
– a problem should not need to be solved more than once
• Tools for communicating with others also help
when communicating with yourself