Upload
others
View
14
Download
0
Embed Size (px)
Citation preview
Reproducible Research with Evidence-‐based Data Analysis
Roger D. Peng, PhD Department of Biosta/s/cs
Johns Hopkins Bloomberg School of Public Health @rdpeng, @simplystats, simplystatistics.org!
!
April 2014
ReplicaAon, ReplicaAon
Replica(on • Focuses on the validity of the
scien/fic claim • “Is this claim true?” • The ulAmate standard for
strengthening scienAfic evidence • New invesAgators, data,
analyAcal methods, laboratories, instruments, etc.
• ParAcularly important in studies that can impact broad policy or regulatory decisions
Underlying Trends • Some studies cannot be replicated: No Ame, No money, Unique/opportunisAc
• Technology is increasing data collecAon throughput; data are more complex and high-‐dimensional
• ExisAng databases can be merged to become bigger databases (but data are used off-‐label)
• CompuAng power allows more sophisAcated analyses, even on “small” data
• Data science, big data, you name it • For every field “X” there is a “ComputaAonal X”
What’s Wrong with ReplicaAon?
• Some studies cannot be replicated • No Ame, decisions must be made now • No money • OpportunisAc / Unique
Air PolluAon and Health Research: A Perfect Storm
• EsAmaAng small health effects in the presence of much stronger signals
• Results inform substanAal policy decisions and affect many stakeholders
• EPA regulaAons can cost billions of dollars
• Complex staAsAcal methods are needed and subjected to intense scruAny
The Result?
• Even basic analyses are difficult to describe • Heavy computaAonal requirements are thrust upon people without adequate training in staAsAcs and compuAng
• Errors are more easily introduced into long analysis pipelines
• Knowledge transfer is inhibited • Results are difficult to replicate or reproduce
Complicated analyses cannot be trusted
Recent Developments
“DecepAon at Duke”
The Duke Saga
“Rock Star” StaAsAcians
InsAtute Of Medicine Report
The IOM Report
In the Discovery/Test ValidaAon stage of omics-‐based tests:
• Data/metadata used to develop test should be made publicly available
• The computer code and fully specified computaAonal procedures used for development of the candidate omics-‐based test should be made sustainably available
• “Ideally, the computer code that is released will encompass all of the steps of computa(onal analysis, including all data preprocessing steps, that have been described in this chapter. All aspects of the analysis need to be transparently reported.”
What is Reproducible Research?
Published ArAcle
Protocol
ScienAfic QuesAon
Nature
Author
Reader
Express train to nature
What is Reproducible Research?
Measured Data
AnalyAc Data
ComputaAonal Results Tables
Figures
Numerical Summaries Text
Author
Reader
Processing code AnalyAc code
PresentaAon code
Protocol
Nature
Published ArAcle
ScienAfic QuesAon
What’s Next for Reproducibility?
• Reproducibility is criAcal for communica/ng a data analysis
• One cannot sufficiently describe an analysis in journal pages or supplementary materials
• General consensus about its importance • No credible plan (yet) for how to implement such a requirement (hint: this is what’s next)
ReplicaAon and Reproducibility
Replica(on • Focuses on the validity of the
scien/fic claim • “Is this claim true?” • The ulAmate standard for
strengthening scienAfic evidence • New invesAgators, data,
analyAcal methods, laboratories, instruments, etc.
• ParAcularly important in studies that can impact broad policy or regulatory decisions
Reproducibility • Presumably focuses on the
validity of the data analysis • “Can we trust this analysis?” • Arguably a minimum
standard for any scienAfic study
• New invesAgators, same data, same methods
• Important when replicaAon is impossible
What Problem Does Reproducibility Solve?
What we get • Transparency • Data Availability • Sodware / Methods
Availability • Improved Transfer of
Knowledge
What we do not get • Validity / Correctness of the
analysis
An analysis can be reproducible and s/ll be wrong
Does requiring reproducibility deter bad analysis?
We want to know “can we trust this analysis?”
Problems with Reproducibility
The premise of reproducible research is that with data/code available, people can check each other and the whole system is self-‐correcAng • Addresses the most “downstream” aspect of the research process – post-‐publicaAon
• Assumes everyone is capable of doing analysis and wants to achieve the same goals (i.e. scienAfic discovery, search for truth)
An Analogy from Allergic Asthma
Allergen Exposure
Specific IgE + sensiAzed mast
cells
Inflammatory mediators (histamine, leukotrienes)
VasodilaAon, mucous (symptoms), bronchoconstricAon
Environmental IntervenAon AnA-‐IgE AnA-‐
inflammatory
MedicaAon
AnA-‐mediator
ScienAfic DisseminaAon Process
Research conducted
(problemaAc?)
Paper submiied to
journal
Paper publicaAon
Post-‐publicaAon review
Editor’s judgment
“MedicaAon”
Peer review Reproducible research
ScienAfic DisseminaAon Process
Research conducted
(problemaAc?)
Paper submiied to
journal
Paper publicaAon
Post-‐publicaAon review
??? Editor’s judgment
“MedicaAon”
Peer review Reproducible research
Biostatistics (2009), 10, 3, pp. 409–423doi:10.1093/biostatistics/kxp010Advance Access publication on April 17, 2009
Air pollution and health in Scotland: a multicity study
DUNCAN LEE∗, CLAIRE FERGUSON
Department of Statistics, University of Glasgow, Glasgow, G12 8QQ [email protected]
RICHARD MITCHELL
Public Health and Health Policy, University of Glasgow, Glasgow, G12 8QQ UK
SUMMARY
This paper presents an epidemiological study investigating the effects of long-term air pollution exposureon public health in Scotland, focusing on the 4 major urban areas, Aberdeen, Dundee, Edinburgh, andGlasgow. In particular, the associations between respiratory hospital admissions in 2005 and exposure toboth PM10 and NO2 between 2002 and 2004 are estimated using a small-area ecological design. The im-plementation of such studies requires careful consideration of a number of statistical issues, including howto model spatial correlation, identifiability of the model parameters, and the possible effects of ecologicalbias. The results show that long-term exposures (over 3 years) to PM10 and NO2 are significantly associ-ated with respiratory hospital admissions in Edinburgh and Glasgow, whereas the risks for Aberdeen andDundee are generally positive but nonsignificant.
Keywords: Air pollution and health; Bayesian spatial modeling; Ecological bias.
1. INTRODUCTION
The adverse effects on health associated with air pollution exposure came to public prominence in themid 1900s as a result of high air-pollution episodes that caused large numbers of excess deaths (Ministryof Public Health, 1954; Firket, 1936). Since then, numerous studies investigating both the short- and thelong-term effects of various constituents of air pollution have been conducted, with the majority findingpositive associations between exposure and both mortality and morbidity events. Short-term studies aretypically based on an ecological (at the population level) time series design, in which counts of mortalityor morbidity events on a given day are related to pollution exposures on the preceding few days. Examplesinclude the National Morbidity, Mortality, and Air Pollution Study (Dominici and others, 2002) and AirPollution and Health: a European Approach (Katsouyanni and others, 2001), which are multicity studiesbased on America and Europe, respectively.
Long-term studies estimate the cumulative effects on health of exposure over a number of years and arebased on either individual or ecological designs. Individual-level studies relate to a large cohort of peoplewhose health status is periodically assessed and subsequently related to ambient pollution concentrations
∗To whom correspondence should be addressed.
c⃝ The Author 2009. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: [email protected].
by guest on January 5, 2011biostatistics.oxfordjournals.org
Dow
nloaded from
Reproducible Research at Biosta/s/cs Biostatistics (2009), 10, 4, pp. 756–772doi:10.1093/biostatistics/kxp029Advance Access publication on July 27, 2009
Second-order estimating equations for the analysis ofclustered current status data
RICHARD J. COOK∗, DAVID TOLUSSO
Department of Statistics and Actuarial Science, University of Waterloo,Waterloo, ON, Canada N2L 3G1
SUMMARY
With clustered event time data, interest most often lies in marginal features such as quantiles or proba-bilities from the marginal event time distribution or covariate effects on marginal hazard functions. Cop-ula models offer a convenient framework for modeling. We present methods of estimating the baselinemarginal distributions, covariate effects, and association parameters for clustered current status data basedon second-order generalized estimating equations. We examine the efficiency gains realized from usingsecond-order estimating equations compared with first-order equations, issues of copula misspecification,and apply the methods to motivating studies including one on the incidence of joint damage in patientswith psoriatic arthritis.
Keywords: Current status data; Generalized estimating equations; Piecewise constant hazards; Relative efficiency.
1. INTRODUCTION
Current status data, also referred to as type I interval censored data, arise when the status of a subjectis only known at a single point in time, and hence, the underlying failure time is either left or rightcensored. Ayer and others (1955) show how to obtain the nonparametric maximum likelihood estimateof the distribution function with current status data using techniques from isotonic regression (Barlowand others, 1972). Methods for fitting multiplicative semiparametric models are described by Xuand others (2004) via sieve estimation, and generalized additive models are fitted using isotonic regres-sion in Shiboski (1998). Sun (2006) provides a comprehensive summary of modern statistical methods forcurrent status data.
Several authors have investigated estimation of joint models for bivariate current status data. In thespirit of Shih and Louis (1995), Wang and Ding (2000) describe a 2-stage semiparametric estimationprocedure for bivariate current status data under a Clayton copula and assess the asymptotic and empiricalproperties of the resulting estimator of Kendall’s τ. van der Laan and Jewell (2002) discuss identifiabil-ity issues for bivariate current status data and develop an expectation-maximization algorithm that canbe used to estimate associated marginal features for constructing tests of independence. Extensions ofthis work appear in Jewell and others (2005) where other topics including goodness of fit of the copula
∗To whom correspondence should be addressed.
c⃝ The Author 2009. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: [email protected].
by guest on January 5, 2011biostatistics.oxfordjournals.org
Dow
nloaded from
Who Reproduces Research?
• Someone needs to do something – Re-‐run the analysis; check results match – Check the code for bugs/errors – Try alternate approaches; check sensiAvity
• The need for someone to do something is inherited from tradiAonal noAon of replicaAon
• Who is “someone” and what are their goals?
Who Reproduces Research?
The truth is A
I don’t care
The truth is B
The truth is not A
Original InvesAgator Reproducers
The truth is A ScienAsts
General Public
???
The Story So Far
• Reproducibility brings transparency, improves communicaAon, and speeds knowledge transfer
• A lot of discussion about how to get people to share data
• Key quesAon of “can we trust this analysis?” is not addressed by reproducibility
• Reproducibility addresses potenAal problems long ader they’ve occurred (“downstream”)
• Secondary analyses are inevitably colored by the interests/moAvaAons of others
Evidence-‐based Data Analysis
• Most data analyses involve stringing together many different tools and methods
• Some methods may be standard for a given field, but others are oden applied ad hoc
• We should apply thoroughly studied (via staAsAcal research), mutually agreed upon methods to analyze data whenever possible
• There should be evidence to jusAfy the applicaAon of a given method
IncorporaAng StaAsAcal ExperAse Into Sodware (1991 ediAon)
“Throughout American or even global industry, there is much advocacy of staAsAcal process control and of understanding processes. Sta(s(cians have a process they espouse but do not know anything about. It is the process of pulng together many Any pieces, the process called data analysis, and is not really understood.” -‐-‐Daryl Pregibon, NRC Report 1991
Consider the Following Problem
You have a sample of 300 observaAons on a conAnuous morbidity outcome Y and a predictor X that is conAnuous biomarker. Describe a process to determine the magnitude of the linear associa(on between X and Y
Evidence-‐based Histogram Bin Width
Histogram of x
x
Frequency
−2 −1 0 1 2 3
05
1015
20
Sturges HA (1926), JASA Scoi DW (1979), Biometrika
What is Data Analysis?
Raw Data Cleaning / ValidaAon Pre-‐processing Exploratory
data analysis
StaAsAcal model development
SensiAvity analysis
Finalize results / report
StaAsAcs!
Evidence-‐based Data Analysis
• Create analyAc pipelines from evidence-‐based components – standardize it
• A Determinis/c Sta/s/cal Machine hip://goo.gl/Qvlhuv
• Once an evidence-‐based analyAc pipeline is established, we shouldn’t mess with it – Analysis with a “transparent box”
• Reduce the “researcher degrees of freedom” • Analogous to a pre-‐specified clinical trial protocol
DeterminisAc StaAsAcal Machine
Dataset
Input metadata
Preprocessing
Model Filng
SensiAvity analysis
Report
Output parameters
Public repository
Benchmark dataset
Benchmark dataset
Benchmark dataset
Methods secAon
Case Study: EsAmaAng Acute Effects of Ambient Air PolluAon Exposure
• Acute/short-‐term effects typically esAmated via panel studies or Ame series studies
• Work originated in late 1970s early 1980s • Key quesAon: “Are short-‐term changes in polluAon associated with short-‐term changes in a populaAon health outcome?”
• Studies usually conducted at community level • Long history of staAsAcal research invesAgaAng proper methods of analysis
New York Data
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●●
●
●●
●
●
●●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●●
●
●
●
●●
●
●●●●●
●
●
●
●●
●
●●●
●
●●
●●
●
●
●
●
●●●
●
●
●
●
●
●●●
●●●
●
●
●
●●
●
●
●●●
●●●
●
●
●
●
●
●
●
●
●●
●●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●●●
●
●●●
●●
●
●
●
●
●
●●
●●
●
●
●
●●●●
●
●
●●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●●●●
●●
●
●●●●
●
●●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●●
●
●
●●
●
●
●
●
●●
●●●
●
●
●●●
●
●●
●
●
●
●
●●●●
●
●
●
●
●●
●
●
●●●
●
●
●
●
●
●
●
●●●
●●●●
●
●●●
●
●
●
●●●
●
●●
●
●
●
●
●●
●
●●●
●
●
●
●
●
●●●●
●
●
●●
●●
●
●
●
●
●
●●
●●
●
●●
●
●
●●●
●●
●
●
●
●
●
●
●
●
●●
●
●●●
●
●●
●
●
●
●●
●●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●●
●
●
●
●●
●
●
●
●●
●●
●●
●●●●●
●●
●
●
●
●●
●
●
●
●
●●●●
●●
●●
●●
●
●
●
●
●
●
●●
●●●
●
●●
●●
●●
●
●
●
●
●●●
●
●●
●
●
●
●
●
●●
●●
●
●●
●●
●
●●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●●●
●
●
●
●
●
●
●●●
●
●
●●
●
●●
●●
●
●●
●
●●●
●
●
●
●
●
●●
●●●
●
●
●
●
●
●
●
●
●
●
●●●
●
●●●●
●
●●
●
●
●
●
●
●
●
●
●
●●●●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●●●●
●
●
●●●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●●
●
●●●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●●
●
●●●●
●
●
●●●
●
●
●
●
●
●●●●
●
●
●
●
●●
●
●
●
●
●
●●
●●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●●●
●●
●●●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●●
●●
●
●
●●
●
●
●
●
●
●●
●
●
●
●●
●●
●
●
●
●
●
●
●●●
●●
●
●●
●
●
●
●
●●●
●●
●
●
●●
●●
●
●●●
●●●●
●●
●●
●
●●
●
●
●
●●
●
●
●●
●
●
●
●
●●
●
●
●●●●
●
●●
●
●
●●
●
●●
●
●
●
●
●
●
●●●●
●
●●●
●●
●●
●
●
●
●
●
●
●●
●
●●
●
●●
●
●
●
●
●
●
●
●●●●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●●●
●●
●
●
●
●●●
●
●
●
●
●●
●
●
●
●●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●●●●
●
●
●
●
●●
●●
●
●●
2002 2003 2004 2005
120
160
200
Mor
talit
y co
unt
●●●●●●
●
●
●●●●
●
●
●
●●
●●●
●
●
●●
●●
●●
●●
●
●
●
●
●
●
●
●
●
●●●
●●●●
●
●●
●●
●●
●
●
●
●
●
●●
●
●●●●●●●●
●
●●●●●●
●
●
●
●●●
●
●●
●●
●
●
●
●
●●
●●
●
●●
●
●
●
●●
●●●●●●●●●
●●
●
●
●
●
●●
●
●
●
●●
●
●●●
●
●
●
●
●●●●
●●
●●●
●
●
●●
●●
●
●
●
●
●●
●
●
●●●●
●
●●●●
●
●●●●●●
●
●
●
●
●
●●●●●
●
2002 2003 2004 2005
020
4060
PM10
Case Study: EsAmaAng Acute Effects of Ambient Air PolluAon Exposure
• Can we encode everything that we have found in staAsAcal/epidemiological research into a single package?
• Time series studies do not have a huge range of variaAon; typically involves similar types of data and similar quesAons
• We can create a determinisAc staAsAcal machine for this area?
DSM Modules for Time Series Studies of Air PolluAon and Health
1. Check for outliers, high leverage, overdispersion 2. Fill in missing data? NO! 3. Model selecAon: EsAmate degrees of freedom
to adjust for unmeasured confounders – Other aspects of model not as criAcal
4. MulAple lag analysis 5. SensiAvity analysis wrt – Unmeasured confounder adjustment – InfluenAal points
Dominici, McDermoi, HasAe (2004) JASA; Peng, Dominici, Louis (2006) JRSS-‐A
DSM Output
Where to Go From Here?
• One DSM is not enough, we need many! • Different problems warrant different approaches and experAse
• A curated library of machines providing state-‐of-‐the art analysis pipelines
• A CRAN/CPAN/CTAN/… for data analysis • Or a “Cochrane CollaboraAon” for data analysis
Cochrane CollaboraAon
Cochrane CollaboraAon
A Curated Library of Data Analysis
• Provide packages that encode data analysis pipelines for given problems, technologies, quesAons
• Curated by experts knowledgeable in the field • DocumentaAon/references given supporAng each module in the pipeline
• Changes introduced ader passing relevant benchmarks/unit tests
Summary
• Reproducible research is important, but does not necessarily solve the criAcal quesAon of whether a data analysis is trustworthy
• Reproducible research focuses on the most “downstream” aspect of research disseminaAon
• Evidence-‐based data analysis would provide standardized, best pracAces for given scienAfic areas and quesAons
• Gives reviewers an important tool without dramaAcally increasing the burden on them
• More effort should be put into improving the quality of “upstream” aspects of scienAfic research
Coming This Fall! G U E S T E D I T O R ’ S I N T R O D U C T I O N
COMPUTING IN SCIENCE & ENGINEERING 11
Reproducible Research: Tools and Strategies for Scientific Computing
T his special issue is comprised of ar-ticles contributed by participants in a workshop held in Vancouver in July 2011 called “Reproducible Research:
Tools and Strategies for Scientific Computing.” The workshop was co-organized by Randall J. LeVeque (the Founders’ Term Professor of Ap-plied Mathematics at the University of Wash-ington), Ian M. Mitchell (an associate professor of computer science at the University of British Columbia), and myself. One of our aims was to improve the visibility of the nascent group of tool
builders working to facilitate really reproducible research in computational science.
This special issue focuses on tools and strate-gies for reproducible computational science and includes seven articles, each describing software development efforts. It begins with an introduc-tory article by the three workshop co-organizers
1521-9615/12/$31.00 © 2012 IEEECOPUBLISHED BY THE IEEE CS AND THE AIP
Victoria StoddenColumbia University
CISE-14-4-Gei.indd 11 6/13/12 2:39 PM
Johns Hopkins Data Science
Johns Hopkins Data Science
• All online, all the Ame • Probability/math stat, staAsAcal inference • Gelng + cleaning data • R programming • Regression modeling • PredicAon + machine learning • Exploratory data analysis • Reproducible research • Data products • Capstone project with industry partners