Reproducible,Research,with, Evidence5based,DataAnalysis,rpeng/talks/pengslides_transscience.pdf · Reproducible,Research,with, Evidence5based,DataAnalysis, Roger,D.,Peng,,PhD, Departmentof)Biosta/s/cs)

Reproducible Research with Evidence-‐based Data Analysis

Roger D. Peng, PhD Department of Biosta/s/cs

Johns Hopkins Bloomberg School of Public Health @rdpeng, @simplystats, simplystatistics.org!

!

April 2014

ReplicaAon, ReplicaAon

Replica(on •  Focuses on the validity of the

scien/fic claim •  “Is this claim true?” •  The ulAmate standard for

strengthening scienAfic evidence •  New invesAgators, data,

analyAcal methods, laboratories, instruments, etc.

•  ParAcularly important in studies that can impact broad policy or regulatory decisions

Underlying Trends •  Some studies cannot be replicated: No Ame, No money, Unique/opportunisAc

•  Technology is increasing data collecAon throughput; data are more complex and high-‐dimensional

•  ExisAng databases can be merged to become bigger databases (but data are used off-‐label)

•  CompuAng power allows more sophisAcated analyses, even on “small” data

•  Data science, big data, you name it •  For every field “X” there is a “ComputaAonal X”

What’s Wrong with ReplicaAon?

•  Some studies cannot be replicated •  No Ame, decisions must be made now •  No money •  OpportunisAc / Unique

Air PolluAon and Health Research: A Perfect Storm

•  EsAmaAng small health effects in the presence of much stronger signals

•  Results inform substanAal policy decisions and affect many stakeholders

•  EPA regulaAons can cost billions of dollars

•  Complex staAsAcal methods are needed and subjected to intense scruAny

The Result?

•  Even basic analyses are difficult to describe •  Heavy computaAonal requirements are thrust upon people without adequate training in staAsAcs and compuAng

•  Errors are more easily introduced into long analysis pipelines

•  Knowledge transfer is inhibited •  Results are difficult to replicate or reproduce

Complicated analyses cannot be trusted

Recent Developments

“DecepAon at Duke”

The Duke Saga

“Rock Star” StaAsAcians

InsAtute Of Medicine Report

The IOM Report

In the Discovery/Test ValidaAon stage of omics-‐based tests:

•  Data/metadata used to develop test should be made publicly available

•  The computer code and fully specified computaAonal procedures used for development of the candidate omics-‐based test should be made sustainably available

•  “Ideally, the computer code that is released will encompass all of the steps of computa(onal analysis, including all data preprocessing steps, that have been described in this chapter. All aspects of the analysis need to be transparently reported.”

What is Reproducible Research?

Published ArAcle

Protocol

ScienAfic QuesAon

Nature

Author

Reader

Express train to nature

What is Reproducible Research?

Measured Data

AnalyAc Data

ComputaAonal Results Tables

Figures

Numerical Summaries Text

Author

Reader

Processing code AnalyAc code

PresentaAon code

Protocol

Nature

Published ArAcle

ScienAfic QuesAon

What’s Next for Reproducibility?

•  Reproducibility is criAcal for communica/ng a data analysis

•  One cannot sufficiently describe an analysis in journal pages or supplementary materials

•  General consensus about its importance •  No credible plan (yet) for how to implement such a requirement (hint: this is what’s next)

ReplicaAon and Reproducibility

Replica(on •  Focuses on the validity of the

scien/fic claim •  “Is this claim true?” •  The ulAmate standard for

strengthening scienAfic evidence •  New invesAgators, data,

analyAcal methods, laboratories, instruments, etc.

•  ParAcularly important in studies that can impact broad policy or regulatory decisions

Reproducibility •  Presumably focuses on the

validity of the data analysis •  “Can we trust this analysis?” •  Arguably a minimum

standard for any scienAfic study

•  New invesAgators, same data, same methods

•  Important when replicaAon is impossible

What Problem Does Reproducibility Solve?

What we get •  Transparency •  Data Availability •  Sodware / Methods

Availability •  Improved Transfer of

Knowledge

What we do not get •  Validity / Correctness of the

analysis

An analysis can be reproducible and s/ll be wrong

Does requiring reproducibility deter bad analysis?

We want to know “can we trust this analysis?”

Problems with Reproducibility

The premise of reproducible research is that with data/code available, people can check each other and the whole system is self-‐correcAng •  Addresses the most “downstream” aspect of the research process – post-‐publicaAon

•  Assumes everyone is capable of doing analysis and wants to achieve the same goals (i.e. scienAfic discovery, search for truth)

An Analogy from Allergic Asthma

Allergen Exposure

Specific IgE + sensiAzed mast

cells

Inflammatory mediators (histamine, leukotrienes)

VasodilaAon, mucous (symptoms), bronchoconstricAon

Environmental IntervenAon AnA-‐IgE AnA-‐

inflammatory

MedicaAon

AnA-‐mediator

ScienAfic DisseminaAon Process

Research conducted

(problemaAc?)

Paper submiied to

journal

Paper publicaAon

Post-‐publicaAon review

Editor’s judgment

“MedicaAon”

Peer review Reproducible research

ScienAfic DisseminaAon Process

Research conducted

(problemaAc?)

Paper submiied to

journal

Paper publicaAon

Post-‐publicaAon review

??? Editor’s judgment

“MedicaAon”

Peer review Reproducible research

Biostatistics (2009), 10, 3, pp. 409–423doi:10.1093/biostatistics/kxp010Advance Access publication on April 17, 2009

Air pollution and health in Scotland: a multicity study

DUNCAN LEE∗, CLAIRE FERGUSON

Department of Statistics, University of Glasgow, Glasgow, G12 8QQ [email protected]

RICHARD MITCHELL

Public Health and Health Policy, University of Glasgow, Glasgow, G12 8QQ UK

SUMMARY

This paper presents an epidemiological study investigating the effects of long-term air pollution exposureon public health in Scotland, focusing on the 4 major urban areas, Aberdeen, Dundee, Edinburgh, andGlasgow. In particular, the associations between respiratory hospital admissions in 2005 and exposure toboth PM10 and NO2 between 2002 and 2004 are estimated using a small-area ecological design. The im-plementation of such studies requires careful consideration of a number of statistical issues, including howto model spatial correlation, identifiability of the model parameters, and the possible effects of ecologicalbias. The results show that long-term exposures (over 3 years) to PM10 and NO2 are significantly associ-ated with respiratory hospital admissions in Edinburgh and Glasgow, whereas the risks for Aberdeen andDundee are generally positive but nonsignificant.

Keywords: Air pollution and health; Bayesian spatial modeling; Ecological bias.

1. INTRODUCTION

The adverse effects on health associated with air pollution exposure came to public prominence in themid 1900s as a result of high air-pollution episodes that caused large numbers of excess deaths (Ministryof Public Health, 1954; Firket, 1936). Since then, numerous studies investigating both the short- and thelong-term effects of various constituents of air pollution have been conducted, with the majority findingpositive associations between exposure and both mortality and morbidity events. Short-term studies aretypically based on an ecological (at the population level) time series design, in which counts of mortalityor morbidity events on a given day are related to pollution exposures on the preceding few days. Examplesinclude the National Morbidity, Mortality, and Air Pollution Study (Dominici and others, 2002) and AirPollution and Health: a European Approach (Katsouyanni and others, 2001), which are multicity studiesbased on America and Europe, respectively.

Long-term studies estimate the cumulative effects on health of exposure over a number of years and arebased on either individual or ecological designs. Individual-level studies relate to a large cohort of peoplewhose health status is periodically assessed and subsequently related to ambient pollution concentrations

∗To whom correspondence should be addressed.

c⃝ The Author 2009. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: [email protected].

by guest on January 5, 2011biostatistics.oxfordjournals.org

Dow

nloaded from

Reproducible Research at Biosta/s/cs Biostatistics (2009), 10, 4, pp. 756–772doi:10.1093/biostatistics/kxp029Advance Access publication on July 27, 2009

Second-order estimating equations for the analysis ofclustered current status data

RICHARD J. COOK∗, DAVID TOLUSSO

Department of Statistics and Actuarial Science, University of Waterloo,Waterloo, ON, Canada N2L 3G1

[email protected]

SUMMARY

With clustered event time data, interest most often lies in marginal features such as quantiles or proba-bilities from the marginal event time distribution or covariate effects on marginal hazard functions. Cop-ula models offer a convenient framework for modeling. We present methods of estimating the baselinemarginal distributions, covariate effects, and association parameters for clustered current status data basedon second-order generalized estimating equations. We examine the efficiency gains realized from usingsecond-order estimating equations compared with first-order equations, issues of copula misspecification,and apply the methods to motivating studies including one on the incidence of joint damage in patientswith psoriatic arthritis.

Keywords: Current status data; Generalized estimating equations; Piecewise constant hazards; Relative efficiency.

1. INTRODUCTION

Current status data, also referred to as type I interval censored data, arise when the status of a subjectis only known at a single point in time, and hence, the underlying failure time is either left or rightcensored. Ayer and others (1955) show how to obtain the nonparametric maximum likelihood estimateof the distribution function with current status data using techniques from isotonic regression (Barlowand others, 1972). Methods for fitting multiplicative semiparametric models are described by Xuand others (2004) via sieve estimation, and generalized additive models are fitted using isotonic regres-sion in Shiboski (1998). Sun (2006) provides a comprehensive summary of modern statistical methods forcurrent status data.

Several authors have investigated estimation of joint models for bivariate current status data. In thespirit of Shih and Louis (1995), Wang and Ding (2000) describe a 2-stage semiparametric estimationprocedure for bivariate current status data under a Clayton copula and assess the asymptotic and empiricalproperties of the resulting estimator of Kendall’s τ. van der Laan and Jewell (2002) discuss identifiabil-ity issues for bivariate current status data and develop an expectation-maximization algorithm that canbe used to estimate associated marginal features for constructing tests of independence. Extensions ofthis work appear in Jewell and others (2005) where other topics including goodness of fit of the copula

∗To whom correspondence should be addressed.

c⃝ The Author 2009. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: [email protected].

by guest on January 5, 2011biostatistics.oxfordjournals.org

Dow

nloaded from

Who Reproduces Research?

•  Someone needs to do something – Re-‐run the analysis; check results match – Check the code for bugs/errors – Try alternate approaches; check sensiAvity

•  The need for someone to do something is inherited from tradiAonal noAon of replicaAon

•  Who is “someone” and what are their goals?

Who Reproduces Research?

The truth is A

I don’t care

The truth is B

The truth is not A

Original InvesAgator Reproducers

The truth is A ScienAsts

General Public

???

The Story So Far

•  Reproducibility brings transparency, improves communicaAon, and speeds knowledge transfer

•  A lot of discussion about how to get people to share data

•  Key quesAon of “can we trust this analysis?” is not addressed by reproducibility

•  Reproducibility addresses potenAal problems long ader they’ve occurred (“downstream”)

•  Secondary analyses are inevitably colored by the interests/moAvaAons of others

Evidence-‐based Data Analysis

•  Most data analyses involve stringing together many different tools and methods

•  Some methods may be standard for a given field, but others are oden applied ad hoc

•  We should apply thoroughly studied (via staAsAcal research), mutually agreed upon methods to analyze data whenever possible

•  There should be evidence to jusAfy the applicaAon of a given method

IncorporaAng StaAsAcal ExperAse Into Sodware (1991 ediAon)

“Throughout American or even global industry, there is much advocacy of staAsAcal process control and of understanding processes. Sta(s(cians have a process they espouse but do not know anything about. It is the process of pulng together many Any pieces, the process called data analysis, and is not really understood.” -‐-‐Daryl Pregibon, NRC Report 1991

Consider the Following Problem

You have a sample of 300 observaAons on a conAnuous morbidity outcome Y and a predictor X that is conAnuous biomarker. Describe a process to determine the magnitude of the linear associa(on between X and Y

Evidence-‐based Histogram Bin Width

Histogram of x

x

Frequency

−2 −1 0 1 2 3

05

1015

20

Sturges HA (1926), JASA Scoi DW (1979), Biometrika

What is Data Analysis?

Raw Data Cleaning / ValidaAon Pre-‐processing Exploratory

data analysis

StaAsAcal model development

SensiAvity analysis

Finalize results / report

StaAsAcs!

Evidence-‐based Data Analysis

•  Create analyAc pipelines from evidence-‐based components – standardize it

•  A Determinis/c Sta/s/cal Machine hip://goo.gl/Qvlhuv

•  Once an evidence-‐based analyAc pipeline is established, we shouldn’t mess with it – Analysis with a “transparent box”

•  Reduce the “researcher degrees of freedom” •  Analogous to a pre-‐specified clinical trial protocol

DeterminisAc StaAsAcal Machine

Dataset

Input metadata

Preprocessing

Model Filng

SensiAvity analysis

Report

Output parameters

Public repository

Benchmark dataset

Benchmark dataset

Benchmark dataset

Methods secAon

Case Study: EsAmaAng Acute Effects of Ambient Air PolluAon Exposure

•  Acute/short-‐term effects typically esAmated via panel studies or Ame series studies

•  Work originated in late 1970s early 1980s •  Key quesAon: “Are short-‐term changes in polluAon associated with short-‐term changes in a populaAon health outcome?”

•  Studies usually conducted at community level •  Long history of staAsAcal research invesAgaAng proper methods of analysis

New York Data

●

●●

●

●

●●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●●

●

●●

●

●

●●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●●

●

●

●

●●

●

●●●●●

●

●

●

●●

●

●●●

●

●●

●●

●

●

●

●

●●●

●

●

●

●

●

●●●

●●●

●

●

●

●●

●

●

●●●

●●●

●

●

●

●

●

●

●

●

●●

●●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●●●

●

●●●

●●

●

●

●

●

●

●●

●●

●

●

●

●●●●

●

●

●●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●●●●

●●

●

●●●●

●

●●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●●

●

●

●●

●

●

●

●

●●

●●●

●

●

●●●

●

●●

●

●

●

●

●●●●

●

●

●

●

●●

●

●

●●●

●

●

●

●

●

●

●

●●●

●●●●

●

●●●

●

●

●

●●●

●

●●

●

●

●

●

●●

●

●●●

●

●

●

●

●

●●●●

●

●

●●

●●

●

●

●

●

●

●●

●●

●

●●

●

●

●●●

●●

●

●

●

●

●

●

●

●

●●

●

●●●

●

●●

●

●

●

●●

●●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●●

●

●

●

●●

●

●

●

●●

●●

●●

●●●●●

●●

●

●

●

●●

●

●

●

●

●●●●

●●

●●

●●

●

●

●

●

●

●

●●

●●●

●

●●

●●

●●

●

●

●

●

●●●

●

●●

●

●

●

●

●

●●

●●

●

●●

●●

●

●●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●●●

●

●

●

●

●

●

●●●

●

●

●●

●

●●

●●

●

●●

●

●●●

●

●

●

●

●

●●

●●●

●

●

●

●

●

●

●

●

●

●

●●●

●

●●●●

●

●●

●

●

●

●

●

●

●

●

●

●●●●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●●●●

●

●

●●●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●●

●

●●●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●●

●

●●●●

●

●

●●●

●

●

●

●

●

●●●●

●

●

●

●

●●

●

●

●

●

●

●●

●●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●●●

●●

●●●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●●

●●

●

●

●●

●

●

●

●

●

●●

●

●

●

●●

●●

●

●

●

●

●

●

●●●

●●

●

●●

●

●

●

●

●●●

●●

●

●

●●

●●

●

●●●

●●●●

●●

●●

●

●●

●

●

●

●●

●

●

●●

●

●

●

●

●●

●

●

●●●●

●

●●

●

●

●●

●

●●

●

●

●

●

●

●

●●●●

●

●●●

●●

●●

●

●

●

●

●

●

●●

●

●●

●

●●

●

●

●

●

●

●

●

●●●●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●●●

●●

●

●

●

●●●

●

●

●

●

●●

●

●

●

●●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●●●●

●

●

●

●

●●

●●

●

●●

2002 2003 2004 2005

120

160

200

Mor

talit

y co

unt

●●●●●●

●

●

●●●●

●

●

●

●●

●●●

●

●

●●

●●

●●

●●

●

●

●

●

●

●

●

●

●

●●●

●●●●

●

●●

●●

●●

●

●

●

●

●

●●

●

●●●●●●●●

●

●●●●●●

●

●

●

●●●

●

●●

●●

●

●

●

●

●●

●●

●

●●

●

●

●

●●

●●●●●●●●●

●●

●

●

●

●

●●

●

●

●

●●

●

●●●

●

●

●

●

●●●●

●●

●●●

●

●

●●

●●

●

●

●

●

●●

●

●

●●●●

●

●●●●

●

●●●●●●

●

●

●

●

●

●●●●●

●

2002 2003 2004 2005

020

4060

PM10

Case Study: EsAmaAng Acute Effects of Ambient Air PolluAon Exposure

•  Can we encode everything that we have found in staAsAcal/epidemiological research into a single package?

•  Time series studies do not have a huge range of variaAon; typically involves similar types of data and similar quesAons

•  We can create a determinisAc staAsAcal machine for this area?

DSM Modules for Time Series Studies of Air PolluAon and Health

1.  Check for outliers, high leverage, overdispersion 2.  Fill in missing data? NO! 3.  Model selecAon: EsAmate degrees of freedom

to adjust for unmeasured confounders –  Other aspects of model not as criAcal

4.  MulAple lag analysis 5.  SensiAvity analysis wrt –  Unmeasured confounder adjustment –  InfluenAal points

Dominici, McDermoi, HasAe (2004) JASA; Peng, Dominici, Louis (2006) JRSS-‐A

DSM Output

Where to Go From Here?

•  One DSM is not enough, we need many! •  Different problems warrant different approaches and experAse

•  A curated library of machines providing state-‐of-‐the art analysis pipelines

•  A CRAN/CPAN/CTAN/… for data analysis •  Or a “Cochrane CollaboraAon” for data analysis

Cochrane CollaboraAon

Cochrane CollaboraAon

A Curated Library of Data Analysis

•  Provide packages that encode data analysis pipelines for given problems, technologies, quesAons

•  Curated by experts knowledgeable in the field •  DocumentaAon/references given supporAng each module in the pipeline

•  Changes introduced ader passing relevant benchmarks/unit tests

Summary

•  Reproducible research is important, but does not necessarily solve the criAcal quesAon of whether a data analysis is trustworthy

•  Reproducible research focuses on the most “downstream” aspect of research disseminaAon

•  Evidence-‐based data analysis would provide standardized, best pracAces for given scienAfic areas and quesAons

•  Gives reviewers an important tool without dramaAcally increasing the burden on them

•  More effort should be put into improving the quality of “upstream” aspects of scienAfic research

Coming This Fall! G U E S T E D I T O R ’ S I N T R O D U C T I O N

COMPUTING IN SCIENCE & ENGINEERING 11

Reproducible Research: Tools and Strategies for Scientific Computing

T his special issue is comprised of ar-ticles contributed by participants in a workshop held in Vancouver in July 2011 called “Reproducible Research:

Tools and Strategies for Scientific Computing.” The workshop was co-organized by Randall J. LeVeque (the Founders’ Term Professor of Ap-plied Mathematics at the University of Wash-ington), Ian M. Mitchell (an associate professor of computer science at the University of British Columbia), and myself. One of our aims was to improve the visibility of the nascent group of tool

builders working to facilitate really reproducible research in computational science.

This special issue focuses on tools and strate-gies for reproducible computational science and includes seven articles, each describing software development efforts. It begins with an introduc-tory article by the three workshop co-organizers

1521-9615/12/$31.00 © 2012 IEEECOPUBLISHED BY THE IEEE CS AND THE AIP

Victoria StoddenColumbia University

CISE-14-4-Gei.indd 11 6/13/12 2:39 PM

Johns Hopkins Data Science

Johns Hopkins Data Science

•  All online, all the Ame •  Probability/math stat, staAsAcal inference •  Gelng + cleaning data •  R programming •  Regression modeling •  PredicAon + machine learning •  Exploratory data analysis •  Reproducible research •  Data products •  Capstone project with industry partners

Documents

Reproducible,Research,with, Evidence5based,DataAnalysis,rpeng/talks/pengslides_transscience.pdf · Reproducible,Research,with, Evidence5based,DataAnalysis, Roger,D.,Peng,,PhD, Departmentof)Biosta/s/cs)