Vivien Bonazzi Ph.D. Program Director: Computational Biology
(NHGRI) Co Chair Software Methods & Systems (BD2K) Biomedical
Big Data Initiative (BD2K)
Slide 2
Myriad Data Types Other Omic ImagingPhenotypic Clinical Genomic
Exposure
Slide 3
Data and Informatics Working Group acd.od.nih.gov/diwg.htm
Slide 4
What Are the Big Problems to Solve? 1. Locating the data 2.
Getting access to the data 3. Extending policies and practices for
data sharing 4. Organizing, managing, and processing biomedical Big
Data 5. Developing new methods for analyzing biomedical Big Data 6.
Training researchers who can use biomedical Big Data
effectively
Slide 5
Overarching Strategy and Goals Two initiatives being proposed
to overcome roadblocks Big Data to Knowledge (BD2K) enable the
biomedical research enterprise to maximize the value of biomedical
data InfrastructurePlus create an adaptive environment at NIH to
sustain world-class biomedical research
Slide 6
Big Data to Knowledge (BD2K): Overview Major trans-NIH
initiative addressing an NIH imperative and key roadblock Aims to
be catalytic and synergistic Overarching goal: By the end of this
decade, enable a quantum leap in the ability of the biomedical
research enterprise to maximize the value of the growing volume and
complexity of biomedical data
Slide 7
I.Facilitating Broad Use of Biomedical Big Data II. Developing
and Disseminating Analysis Methods and Software for Biomedical Big
Data III. Enhancing Training for Biomedical Big Data IV.
Establishing Centers of Excellence for Biomedical Big Data BD2K:
Four Programmatic Areas
Slide 8
Area 1: Data Sharing & Access A. Policies to Facilitate
Data Sharing. B. Data Catalog: Data Discovery, Citation, Links to
Literature. C. Frameworks for Community-Based Solutions to
Developing Data Standards. D. Enabling Research Use of Clinical
Data. Facilitating usage and sharing of biomedical big data New
Policies to Encourage Data & Software Sharing Index of Research
Datasets to Facilitate Data Location & Citation Community-based
Development of Data & Metadata Standards
Slide 9
Area 2: Software and Systems Development A. Grants for software
development B. Software Registry: Making biomedical software
findable and citable C. Cloud computing: Facilitating Data Analysis
D. Dynamic Social Engagement via social media Development of
analysis methods and software Software to Meet Needs of the
Biomedical Research Community Facilitating Data Analysis: Access to
Large-scale Computing Dynamic Community Engagement of Users and
Developers
Slide 10
Software Grants Current and emerging needs for using, managing,
and analyzing the larger and more complex data sets inherent to
biomedical Big Data Compression/Reduction Visualization Provenance
Data Wrangling Area 2: Software and Systems Development
Slide 11
Big Data needs Big Computing Cloud Computing Leveraging the
cloud Storing and analyzing huge data sets Collaborative
environment Developing appropriate policies for use of controlled
access data in the cloud (dbGaP) Developing working relationships
with major cloud providers AWS, Google, Microsoft (Azure) HPC More
exploration with Supercomputing facilities Area 2: Software and
Systems Development
Slide 12
Area 3: Training Enhancing computational training Increase
Number of Computationally Skilled Trainees Strengthen the
Quantitative Skills of All Researchers Enhance NIH Review and
Program Oversight
Slide 13
Area 4: Centers A. Investigator-initiated Centers B.
NIH-specified Centers Establishing centers of excellence
Collaborative environments & technologies Data integration
Analysis & modeling methods Computer science & statistical
approaches
Slide 14
Big Data to Knowledge (BD2K) bd2k.nih.gov
Slide 15
Biomedical Research as Part of the Digital Enterprise Philip E.
Bourne Ph.D. Associate Director for Data Science National
Institutes of Health
Slide 16
Myriad Data Types Other Omic ImagingPhenotypic Clinical Genomic
Exposure
Slide 17
Myriad Data Types Other Omic ImagingPhenotypic Clinical Genomic
Exposure
Slide 18
Components of The Academic Digital Enterprise Consists of
digital assets E.g. datasets, papers, software, lab notes Each
asset is uniquely identified and has provenance, including access
control E.g. publishing simply involves changing the access control
Digital assets are interoperable across the enterprise
Slide 19
Lets Break Down the Silos New policies, regulations e.g. data
sharing Economic drivers The promise of shared data
Slide 20
The NIH is Starting to Think About the Digital Enterprise Big
Data to Knowledge (BD2K) bd2k.nih.gov
Slide 21
This is great, but BD2K is just a start, what will the end
product look like?
Slide 22
To get to that end point we have to consider the complete
research lifecycle
Slide 23
The Research Life Cycle will Persist IDEAS HYPOTHESES
EXPERIMENTS DATA - ANALYSIS - COMPREHENSION - DISSEMINATION
Slide 24
Tools and Resources Will Continue To Be Developed IDEAS
HYPOTHESES EXPERIMENTS DATA - ANALYSIS - COMPREHENSION -
DISSEMINATION Authoring Tools Lab Notebooks Data Capture Software
Analysis Tools Visualization Scholarly Communication
Slide 25
Those Elements of the Research Life Cycle will Become More
Interconnected Around a Common Framework IDEAS HYPOTHESES
EXPERIMENTS DATA - ANALYSIS - COMPREHENSION - DISSEMINATION
Authoring Tools Lab Notebooks Data Capture Software Analysis Tools
Visualization Scholarly Communication
Slide 26
New/Extended Support Structures Will Emerge IDEAS HYPOTHESES
EXPERIMENTS DATA - ANALYSIS - COMPREHENSION - DISSEMINATION
Authoring Tools Lab Notebooks Data Capture Software Analysis Tools
Visualization Scholarly Communication Commercial & Public Tools
Git-like Resources By Discipline Data Journals Discipline- Based
Metadata Standards Community Portals Institutional Repositories New
Reward Systems Commercial Repositories Training